morph.io takes the hassle out of web scraping.
Write your scraper in the language you know and love, push your code to GitHub, and we take care of the boring bits. Things like running your scraper regularly, alerting you if there's a problem, storing your data, and making your data available for download or through a super-simple API.
To sign in you'll need a GitHub account. This is where your scraper code is stored.
Your email address is requested from GitHub and is only used to send email alerts for scrapers you're watching. Read and write access to all your public repositories is needed by the application to allow you to create a new scraper.
Have your scraper create a SQLite database in the current working directory called
data.sqlite
.
How your scraper does that and what tables or fields you include is entirely up to you.
If there's a table called data
then we'll show that on your scraper page first.
Here's how you can do this in your programming language:
The scraperwiki
Rubygem makes it easy to store data in an SQLite database.
You need to use the morph_defaults
branch of our fork so that the correct database name is used.
When you create a new scraper via morph.io we include this version in the default Gemfile.
To save data all you need to do is call the ScraperWiki.save_sqlite
method.
Creating the database and setting up tables is done automagically by the library:
ScraperWiki.save_sqlite([:name], {name: "susan", occupation: "software developer"})
The scraperwiki
PHP library makes it easy to store data in an SQLite database.
You need to use the morph_defaults
branch of our fork so that the correct database name is used.
When you create a new scraper via morph.io we include this version in the default Composer file.
To save data all you need to do is call the scraperwiki::save_sqlite
method.
Creating the database and setting up tables is done automagically by the library:
scraperwiki::save_sqlite(array("name"), array("name" => "susan", "occupation" => "software developer");
The scraperwiki
Python library makes it easy to store data in an SQLite database.
You need to use the morph_defaults
branch of our fork so that the correct database name is used.
When you create a new scraper via morph.io we include this version in the default requirements.txt
.
To save data all you need to do is call the scraperwiki.sql.save
method.
Creating the database and setting up tables is done automagically by the library:
scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})
The Database::DumpTruck
Perl module makes it easy to store data in an SQLite database.
To save data all you need to do is use the Database::DumpTruck
module.
use Database::DumpTruck; # Open a database handle my $dt = Database::DumpTruck->new({dbname => 'data.sqlite', table => 'data'}); #insert content $dt->insert({ name => 'Susan', occupation => 'software developer' }); #create index $dt->create_index(['name']); #update content. You can use upsert when storing new data too. $dt->upsert({ name => 'Susan', occupation => 'product owner' });
The sqlite3
Node module makes it easy to store data in an SQLite database.
To save data all you need to do is use the sqlite3
module
to create your database and table, and then use the prepare
,
run
,
and finalize
functions to insert new records.
var sqlite3 = require("sqlite3").verbose(); # Open a database handle var db = new sqlite3.Database("data.sqlite"); db.serialize(function() { # Create new table db.run("CREATE TABLE IF NOT EXISTS data (title TEXT)"); # Insert a new record var statement = db.prepare("INSERT INTO data(title) VALUES (?)"); statement.run('A new title to add'); statement.finalize(); });
You can use any libraries you want with your scraper. This makes things incredibly flexible and powerful. It's also incredibly straightforward.
You have a favorite scraping library? No problem. You need a library to convert some obscure file format into something sensible? No problem.
For more details read the Using any library in your scraper page.
There are two different ways you can run scrapers locally while you're writing them.
Either you can use the morph command-line client or you can install the development tools for your language of choice on your machine.
See the page "Running scrapers on your own machine" for details on how to do this.
Right now there are very few limits. We are trusting you that you won't abuse this.
However, we do impose a couple of hard limits on running scrapers so they don't take up too many resources
If a scraper runs out of memory or runs too long it will get killed automatically.
If you're hitting up against these hard limits and still want to use morph.io please do get in touch and let us know and we'll see what we can do to help.
There's also a soft limit:
If a scraper generates more than 10,000 lines of log output the scraper will continue running uninterrupted. You just won't see any more output than that. To avoid this happening simply print less stuff to the screen.
Note that we are keeping track of the amount of cpu time (and a whole bunch of other metrics) that you and your scrapers are using. So, if we do find that you are using too much (and no we don't know what that is right now) we reserve the right to kick you out. In reality first we'll ask you nicely to stop.
You can watch any scraper, user, or organisation on morph.io. Each day we'll email you if any of those watched scrapers errors so you know it needs fixing.
When you're logged in you'll also see a handy list of your watched scrapers that are erroring on the home page.
On each scraper you can see who has downloaded or used the API to get the data. Your downloads are also listed there.
By being able to see openly who is using what, we aim to promote collaboration and serendipity. Creating or scraping data is important, but it's people using it that really makes it exciting. Showing who downloads what, connects people making scrapers with those who use the data.
There may be situations where you don't want to be identified as downloading some data. In that case please create an alternate GitHub (and morph.io) account which does not identify you and use that to download the data.
If that's not possible, or you have any questions, please contact us.
Please keep things legal. Don't scrape anything you're not allowed to. If you do end up doing anything you're not allowed to, it's your responsibility not ours.