Extracts Jeopardy! clues from the J! Archive website and dumps them into a SQLite database for use elsewhere (no particular application is intended).
Python 2.7.5 and SQLite 3.7.12 are tested. Depends on BeautifulSoup 4 and the lxml parser.
bash
pip install beautifulsoup4
pip install lxml
git clone git://github.com/whymarrh/jeopardy-parser.git
cd jeopardy-parser
python download.py
python parser.py
Thanks to @knicholes for the Python download script.
The build script is doing two important things:
The first part takes ~6.5 hours, the second part should take ~20 minutes (on a 1.7 GHz Core i5 w/ 4 GB RAM). Yes, that's a rather long time -- please submit a pull request if you can think of a way to shorten it. In total, running the build script will require ~7 hours.
As an aside: the complete download of the pages is ~300MB, and the resulting database file is ~30MB.
The database is split into 5 tables:
| Table name | What it holds |
| ----------------- | ------------------------------------------------------ |
| airdates
| Airdates for the shows, indexed by game number |
| documents
| Mappings from clue IDs to clue text and answers |
| categories
| The categories |
| clues
| Clue IDs with metadata (game number, round, and value) |
| classifications
| Mappings from clue IDs to category IDs |
To get all the clues along with their metadata:
sql
SELECT clues.id, game, round, value, clue, answer
FROM clues
JOIN documents ON clues.id = documents.id
-- WHERE <expression>
;
To get the category that a clue is in, given a clue id:
sql
SELECT clue_id, category
FROM classifications
JOIN categories ON category_id = categories.id
-- WHERE <expression>
;
To get everything (although it is better to pick and choose what you're looking for):
sql
SELECT clues.id, clues.game, airdate, round, value, category, clue, answer
FROM clues
JOIN airdates ON clues.game = airdates.game
JOIN documents ON clues.id = documents.id
JOIN classifications ON clues.id = classifications.clue_id
JOIN categories ON classifications.category_id = categories.id
-- WHERE <expression>
;
This software is released under the MIT License.
Average successful run time: less than 5 seconds
Total run time: less than 5 seconds
Total cpu time used: less than 5 seconds
Total disk space used: 38.2 KB