Pitchfork is the largest indie music site on the Internet (in the English-speaking world, at least), updating its pages daily with the latest indie music rumblings, interviews with budding artists, sneak previews of new albums and artist collaborations, and, most notably, a suite of music reviews by dedicated music critics forming Pitchfork's staff. I follow Pitchfork's album reviews religiously and I am not alone in feeling that their 'Best New Music' category routinely captures the best that modern music has to offer. But how do these data behave?
It is necessary to scrape Pitchfork's webpages and parse the relevant information for each album. This is accomplished by
scrape-and-parse.py which uses the
sqlite3 to produce a SQL database
pitchfork-reviews.db with the following tables:
id|Unique album identifier assigned by Pitchfork. e.g., Mac DeMarco's Salad Days album is uniquely identified by 19170 as visible in its album review url
album|Name of the album.
artist|Name of the album's artist.
label|Name of the label that produced the album.
released|Year the album was released. (May be missing)
reviewer|Name(s) of the album's Pitchfork reviewer(s)
score|Score given to the album: 0.0 to 10.0 in increments of 0.1
accolade|"Best New Music" or "Best New Reissue"
published|Date the review was published. YYYY-MM-DD
url|Pitchfork URL of the album review.
id|Unique artist identifier assigned by Pitchfork. e.g., Warpaint is uniquely identified by 28034 as visible in their artist url
artist|Name of the artist.
url|Pitchfork URL of the artist.
id|Unique reviewer identifier, auto-assigned as new reviewers enter the database.
reviewer|Name(s) of reviewer(s).
url|Pitchfork URL of the reviewer. If none, URL is simply
DBI packages allow
R to interact with
pitchfork-reviews.db using SQLite syntax. As these data are not prohibitively large, the easiest option is to load the database into
R in its entirety. This is accomplished by
load-database.R. These data are current as of January 16, 2016.
Following execution of
munge-data.R file munges the raw data into a more usable form, including reviewer name corrections and helpful date information via
2015-01-14-pitchfork-reviews.Rmd presents a fully reproducible data analysis of these Pitchfork album review data, making use of several helpful
R packages including
ggplot2. Opening this file in
RStudio and "knitting" the document with
knitr produces an
docx. The following is a link to the analysis hosted on my webpage:
Total run time: less than 5 seconds
Total cpu time used: less than 5 seconds
Total disk space used: 6.81 MB