What's the average age of Best Actor/Actress Oscar winners since 1990? MySQL, Python and Excel/CSV can help!

 

Summary:

My good friend Michelle Roell and I built a database of Oscar-winning actors, actresses and movies within PhPMyAdmin, ran queries using Python/MySQL and composed visual data in Excel. Queries included the nationality and education background of actors/actresses, movies with the highest box office revenue, and genres which earned the most “Best Picture” awards. (Feb. – May 2015)

 

Description: 

Our workflow started by gathering data revolving around Best Actor and Actress Oscar winners.  We scraped data from tables on Wikipedia.com pages and attempted to gather biographical data from Biography.com but settled for more friendly Wikipedia.com tables. We created our ER Diagram, Relational Vocabulary and Table Sketch to begin intellectually figuring out our database.  Our flowchart helped us piece the database structure together with our already-gathered data and our future Python/SQL querying.

Once we knew what should go into each table, we created clean CSV files from each of our master data tables.  We used the python script we were taught to insert each row of data into an already structured database in phpMyAdmin.  We queried the database once it was full of data and exported a new CSV via Python.

Using that CSV file, we created a pivot table in Excel and designed a better looking, more understandable graphic to answer our main question.  We thought Best Actors were probably on average older than Best Actresses, and it’s true!  The Best Actors tally in at 44 years old and the Best Actresses are 37 years old.

One of our biggest challenges was cleaning up our CSV files to accurately represent what was needed in the database.  So many of our actors and actresses shared data, for example, being born in “Los Angeles, CA” or attending “Bristol Old Vic Theatre School”.  Then there were the few instances where they shared movies (“The Silence of the Lambs” and “As Good As It Gets”).  Picking out those shared data points, and ensuring that only one ended up in the database, was more difficult than you’d expect considering the small amount of data we had in comparison to other groups.  We re-created our CSV files more than once to ensure that our data was correct in the database.

In conclusion, our project not only taught us data scraping methods, normalizing CSV files, python scripting, SQL querying and graphic representation but also how to think about databases and the thousands of small decisions that have to happen when building even a simple database.  It was challenging, but very satisfying to get an answer to a question we had asked months earlier!