IPINST / PPP_Scraper

ScraperWiki implementation to extract data from PPP monthly pdf docs. For use by Morph.io.


PPP_Scraper

This is an implementation of the scraperwiki package. The app connects to the UN Department of Peacekeeping (DPKO) report on troop contributions for the previous month hosted on their website. The url pattern uses the pattern http://www.un.org/en/peacekeeping/contributors/<year>/<3 or 4 letter month abbr><last 2 numbers of year>_3.pdf. March 2014's file would be at http://www.un.org/en/peacekeeping/contributors/2014/mar14_3.pdf for instance.

The app then converts the pdf to xml and uses page position to parse into a sqllite db. Country names are positioned between 130 and 140 px left, mission designations are between 276 and 280 px left, and data is anything more than 350 px left. This app is built to run on the Morph.io platform which contains an API that is used by the PPP_Loader app. It is also possible to run scraperwiki locally, though I've had trouble getting it to work.

Morph.io runs this script directly from GitHub once daily and data is only inserted if it doesnt exist in db.

Dependencies

  • scraperwiki - Web scraping library. The PDF to XML functionality to convert into parsable format.
  • Request - Python HTTP library to check connection
  • lxml - Python XML library to navigate the XML tree
  • python native modules: time, datetime, calendar, urllib2
  • Data source: DPKO website

Contributors byndcivilization IPINST

Last run completed successfully .

Console output of last run

Injecting configuration and compiling... Injecting scraper and running... http://www.un.org/en/peacekeeping/contributors/2017/aug17_3.pdf The pdf file has 512746 bytes After converting to xml it has 533617 bytes The first 200 characters are: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml producer="poppler" version="0.24.5"> <page number="1" position="absolute" top="0" left="0" height="1188" width=

Data

Downloaded 1297 times by IPINST MikeRalphson

To download data sign in with GitHub

Download table (as CSV) Download SQLite database (4.26 MB) Use the API

rows 10 / 35521

F M mission dateString tccIso3Alpha T date tccIso3Num type tcc
0
1
UNMISS
3/31/14
ALB
1
20140331
008
Individual Police
Albania
0
5
MONUSCO
3/31/14
DZA
5
20140331
012
Experts on Mission
Algeria
0
3
MINURSO
3/31/14
ARG
3
20140331
032
Experts on Mission
Argentina
1
11
MINUSTAH
3/31/14
ARG
12
20140331
032
Individual Police
Argentina
40
528
MINUSTAH
3/31/14
ARG
568
20140331
032
Contingent Troop
Argentina
16
249
UNFICYP
3/31/14
ARG
265
20140331
032
Contingent Troop
Argentina
1
10
UNMIL
3/31/14
ARG
11
20140331
032
Individual Police
Argentina
0
4
UNMISS
3/31/14
ARG
4
20140331
032
Individual Police
Argentina
0
3
UNOCI
3/31/14
ARG
3
20140331
032
Individual Police
Argentina
0
3
UNTSO
3/31/14
ARG
3
20140331
032
Experts on Mission
Argentina

Statistics

Average successful run time: less than a minute

Total run time: about 6 hours

Total cpu time used: 10 minutes

Total disk space used: 4.33 MB

History

  • Manually ran revision 94ba9cb0 and completed successfully .
    nothing changed in the database
  • Manually ran revision 14dfa458 and failed .
    nothing changed in the database
  • Manually ran revision 3defd6d3 and completed successfully .
    nothing changed in the database
  • Manually ran revision 5c1b56b6 and completed successfully .
    nothing changed in the database
  • Manually ran revision ba7c5a95 and failed .
    nothing changed in the database
  • ...
  • Created on morph.io

Show complete history

Scraper code

Python

PPP_Scraper / scraper.py