hugobarbieux / first_pdf_scraper

First test of PDF scraper


This is a scraper that runs on Morph. To get started see the documentation

Contributors hugobarbieux

Last run completed successfully .

Console output of last run

Injecting configuration and compiling... Injecting scraper and running... The pdf file has 92908 bytes After converting to xml it has 15093 bytes The first 2000 characters are: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd"> <pdf2xml producer="poppler" version="0.24.5"> <page number="1" position="absolute" top="0" left="0" height="1262" width="892"> <fontspec id="0" size="16" family="Times" color="#000000"/> <fontspec id="1" size="10" family="Times" color="#000000"/> <fontspec id="2" size="16" family="Times" color="#000000"/> <fontspec id="3" size="12" family="Times" color="#000000"/> <fontspec id="4" size="8" family="Times" color="#000000"/> <text top="94" left="504" width="5" height="16" font="0"><b> </b></text> <text top="52" left="797" width="3" height="14" font="1"><b> </b></text> <text top="65" left="728" width="74" height="21" font="0"><b>H/16–17 </b></text> <text top="86" left="699" width="103" height="21" font="0"><b>1st Meeting </b></text> <text top="107" left="797" width="5" height="21" font="0"><b> </b></text> <text top="134" left="108" width="5" height="16" font="2"> </text> <text top="1193" left="108" width="5" height="16" font="2"> </text> <text top="148" left="352" width="194" height="21" font="0"><b>HOUSE COMMITTEE </b></text> <text top="169" left="446" width="5" height="21" font="0"><b> </b></text> <text top="190" left="410" width="16" height="21" font="0"><b>M</b></text> <text top="193" left="426" width="56" height="16" font="3"><b>INUTES</b></text> <text top="190" left="483" width="5" height="21" font="0"><b> </b></text> <text top="211" left="108" width="5" height="21" font="0"><b> </b></text> <text top="232" left="395" width="108" height="21" font="0"><b>29 June 2016 </b></text> <text top="253" left="108" width="5" height="21" font="2"> </text> <text top="273" left="108" width="64" height="21" font="2">Present: </text> <text top="294" left="108" width="154" height="21" font="2">L. Campbell-Savours </text> <text top="315" left="108" width="172" height="21" font="2">B. D’Souza (Chairman) </text> <text top="336" left="108" width="162" height="21" font="2">L. Hope of Craighead </tex The pages are numbered: ['1', '2', '3'] {'top': '94', 'font': '0', 'left': '504', 'height': '16', 'width': '5'} <b> </b> {'text': '<b> </b>\n', 'ID': 1} {'top': '52', 'font': '1', 'left': '797', 'height': '14', 'width': '3'} <b> </b> {'text': '<b> </b>\n', 'ID': 2} {'top': '65', 'font': '0', 'left': '728', 'height': '21', 'width': '74'} <b>H/1617 </b> {'text': '<b>H/1617 </b>\n', 'ID': 3} {'top': '86', 'font': '0', 'left': '699', 'height': '21', 'width': '103'} <b>1st Meeting </b> {'text': '<b>1st Meeting </b>\n', 'ID': 4} {'top': '107', 'font': '0', 'left': '797', 'height': '21', 'width': '5'} <b> </b> {'text': '<b> </b>\n', 'ID': 5} {'top': '134', 'font': '2', 'left': '108', 'height': '16', 'width': '5'} {'text': ' ', 'ID': 6} {'top': '1193', 'font': '2', 'left': '108', 'height': '16', 'width': '5'} {'text': ' ', 'ID': 7} {'top': '148', 'font': '0', 'left': '108', 'height': '21', 'width': '5'} <b> </b> {'text': '<b> </b>\n', 'ID': 8} {'top': '169', 'font': '2', 'left': '108', 'height': '21', 'width': '320'} The Committee <b>took note</b> {'text': 'The Committee <b>took note</b>\n', 'ID': 9} {'top': '190', 'font': '2', 'left': '108', 'height': '21', 'width': '5'} {'text': ' ', 'ID': 10} {'top': '211', 'font': '0', 'left': '108', 'height': '21', 'width': '244'} <b>7. ANY OTHER BUSINESS </b> {'text': '<b>7. ANY OTHER BUSINESS </b>\n', 'ID': 11} {'top': '232', 'font': '2', 'left': '108', 'height': '21', 'width': '5'} {'text': ' ', 'ID': 12} {'top': '253', 'font': '2', 'left': '108', 'height': '21', 'width': '679'} The Committee discussed the security of Members following recent events. Members noted {'text': 'The Committee discussed the security of Members following recent events. Members noted ', 'ID': 13} {'top': '273', 'font': '2', 'left': '108', 'height': '21', 'width': '635'} the importance of raising any concerns they had regarding their personal security with {'text': 'the importance of raising any concerns they had regarding their personal security with ', 'ID': 14} {'top': '294', 'font': '2', 'left': '108', 'height': '21', 'width': '663'} officials so they could be dealt with swiftly and measures taken as required. Members who {'text': 'officials so they could be dealt with swiftly and measures taken as required. Members who ', 'ID': 15} {'top': '315', 'font': '2', 'left': '108', 'height': '21', 'width': '606'} had concerns were invited to contact the Director of Security for Parliament. The {'text': 'had concerns were invited to contact the Director of Security for Parliament. The ', 'ID': 16} {'top': '336', 'font': '2', 'left': '108', 'height': '21', 'width': '655'} Consultative Panel on Parliamentary Security was considering measures required to keep {'text': 'Consultative Panel on Parliamentary Security was considering measures required to keep ', 'ID': 17} {'top': '357', 'font': '2', 'left': '108', 'height': '21', 'width': '420'} Parliamentarians safe when off the Estate as well as on it. {'text': 'Parliamentarians safe when off the Estate as well as on it. ', 'ID': 18} {'top': '378', 'font': '2', 'left': '108', 'height': '21', 'width': '5'} {'text': ' ', 'ID': 19} {'top': '399', 'font': '2', 'left': '108', 'height': '21', 'width': '5'} {'text': ' ', 'ID': 20} {'top': '420', 'font': '5', 'left': '108', 'height': '20', 'width': '5'} {'text': ' ', 'ID': 21} {'top': '440', 'font': '5', 'left': '108', 'height': '20', 'width': '5'} {'text': ' ', 'ID': 22} {'top': '460', 'font': '2', 'left': '676', 'height': '21', 'width': '115'} Rob Whiteway {'text': 'Rob Whiteway ', 'ID': 23} {'top': '481', 'font': '2', 'left': '555', 'height': '21', 'width': '235'} Clerk of the House Committee {'text': 'Clerk of the House Committee ', 'ID': 24} {'top': '501', 'font': '2', 'left': '713', 'height': '21', 'width': '77'} June 2016 {'text': 'June 2016 ', 'ID': 25} {'top': '528', 'font': '2', 'left': '108', 'height': '16', 'width': '5'} {'text': ' ', 'ID': 26} {'top': '549', 'font': '2', 'left': '108', 'height': '16', 'width': '5'} {'text': ' ', 'ID': 27} {'top': '570', 'font': '2', 'left': '108', 'height': '16', 'width': '5'} {'text': ' ', 'ID': 28}

Data

Downloaded 0 times

To download data sign in with GitHub

Download table (as CSV) Download SQLite database (3 KB) Use the API

rows 10 / 28

ID text
1
<b> </b>
2
<b> </b>
3
<b>H/1617 </b>
4
<b>1st Meeting </b>
5
<b> </b>
6
7
8
<b> </b>
9
The Committee <b>took note</b>
10

Statistics

Average successful run time: less than 5 seconds

Total run time: less than 10 seconds

Total cpu time used: less than 5 seconds

Total disk space used: 27.9 KB

History

  • Manually ran revision 59e07db7 and completed successfully .
    28 records added, 2 records removed in the database
  • Manually ran revision ae3317ed and failed .
    2 records added in the database
  • Created on morph.io

Scraper code

Python

first_pdf_scraper / scraper.py