morph.io: Documentation

morph.io takes the hassle out of web scraping.

Write your scraper in the language you know and love, push your code to GitHub, and we take care of the boring bits. Things like running your scraper regularly, alerting you if there's a problem, storing your data, and making your data available for download or through a super-simple API.

Programming languages

Scrapers can be written in Ruby, PHP, Python, Perl or Node.js. Simply follow the naming convention:

Language	What to call the main scraper file
Ruby	`scraper.rb`
PHP	`scraper.php`
Python	`scraper.py`
Perl	`scraper.pl`
Node.js	`scraper.js`

Improve this section

Sign in with GitHub

To sign in you'll need a GitHub account. This is where your scraper code is stored.

Your email address is requested from GitHub and is only used to send email alerts for scrapers you're watching. Read and write access to all your public repositories is needed by the application to allow you to create a new scraper.

Improve this section

Storing your data

Have your scraper create a SQLite database in the current working directory called data.sqlite. How your scraper does that and what tables or fields you include is entirely up to you. If there's a table called data then we'll show that on your scraper page first.

Here's how you can do this in your programming language:

The scraperwiki Rubygem makes it easy to store data in an SQLite database. You need to use the morph_defaults branch of our fork so that the correct database name is used. When you create a new scraper via morph.io we include this version in the default Gemfile.

To save data all you need to do is call the ScraperWiki.save_sqlite method. Creating the database and setting up tables is done automagically by the library:

ScraperWiki.save_sqlite([:name], {name: "susan", occupation: "software developer"})

The scraperwiki PHP library makes it easy to store data in an SQLite database. You need to use the morph_defaults branch of our fork so that the correct database name is used. When you create a new scraper via morph.io we include this version in the default Composer file.

To save data all you need to do is call the scraperwiki::save_sqlite method. Creating the database and setting up tables is done automagically by the library:

scraperwiki::save_sqlite(array("name"), array("name" => "susan", "occupation" => "software developer");

The scraperwiki Python library makes it easy to store data in an SQLite database. You need to use the morph_defaults branch of our fork so that the correct database name is used. When you create a new scraper via morph.io we include this version in the default requirements.txt.

To save data all you need to do is call the scraperwiki.sql.save method. Creating the database and setting up tables is done automagically by the library:

scraperwiki.sqlite.save(unique_keys=['name'], data={"name": "susan", "occupation": "software developer"})

The Database::DumpTruck Perl module makes it easy to store data in an SQLite database.

To save data all you need to do is use the Database::DumpTruck module.

use Database::DumpTruck;

# Open a database handle
my $dt = Database::DumpTruck->new({dbname => 'data.sqlite', table => 'data'});

#insert content
$dt->insert({
  name => 'Susan',
  occupation => 'software developer'
});

#create index
$dt->create_index(['name']);

#update content. You can use upsert when storing new data too.
$dt->upsert({
  name => 'Susan',
  occupation => 'product owner'
});

The sqlite3 Node module makes it easy to store data in an SQLite database.

To save data all you need to do is use the sqlite3 module to create your database and table, and then use the prepare, run, and finalize functions to insert new records.

var sqlite3 = require("sqlite3").verbose();

# Open a database handle
var db = new sqlite3.Database("data.sqlite");
db.serialize(function() {

  # Create new table
  db.run("CREATE TABLE IF NOT EXISTS data (title TEXT)");

  # Insert a new record
  var statement = db.prepare("INSERT INTO data(title) VALUES (?)");
  statement.run('A new title to add');
  statement.finalize();
});

Improve this section

Using any library in your scraper

You can use any libraries you want with your scraper. This makes things incredibly flexible and powerful. It's also incredibly straightforward.

You have a favorite scraping library? No problem. You need a library to convert some obscure file format into something sensible? No problem.

For more details read the Using any library in your scraper page.

Improve this section

Testing your scraper locally

There are two different ways you can run scrapers locally while you're writing them.

Either you can use the morph command-line client or you can install the development tools for your language of choice on your machine.

See the page "Running scrapers on your own machine" for details on how to do this.

Improve this section

Usage limits

Right now there are very few limits. We are trusting you that you won't abuse this.

However, we do impose a couple of hard limits on running scrapers so they don't take up too many resources

max 512 MB memory
max 24 hours run time for a single run

If a scraper runs out of memory or runs too long it will get killed automatically.

If you're hitting up against these hard limits and still want to use morph.io please do get in touch and let us know and we'll see what we can do to help.

There's also a soft limit:

max 10,000 lines of log output

If a scraper generates more than 10,000 lines of log output the scraper will continue running uninterrupted. You just won't see any more output than that. To avoid this happening simply print less stuff to the screen.

Note that we are keeping track of the amount of cpu time (and a whole bunch of other metrics) that you and your scrapers are using. So, if we do find that you are using too much (and no we don't know what that is right now) we reserve the right to kick you out. In reality first we'll ask you nicely to stop.

Improve this section

Watching scrapers for errors

You can watch any scraper, user, or organisation on morph.io. Each day we'll email you if any of those watched scrapers errors so you know it needs fixing.

When you're logged in you'll also see a handy list of your watched scrapers that are erroring on the home page.

Improve this section

See who is using the data

On each scraper you can see who has downloaded or used the API to get the data. Your downloads are also listed there.

By being able to see openly who is using what, we aim to promote collaboration and serendipity. Creating or scraping data is important, but it's people using it that really makes it exciting. Showing who downloads what, connects people making scrapers with those who use the data.

There may be situations where you don't want to be identified as downloading some data. In that case please create an alternate GitHub (and morph.io) account which does not identify you and use that to download the data.

If that's not possible, or you have any questions, please contact us.

Improve this section

Your responsibility

Please keep things legal. Don't scrape anything you're not allowed to. If you do end up doing anything you're not allowed to, it's your responsibility not ours.

Improve this section