Documentation


How to use any library in your scraper

You can use any libraries you want with your scraper. This makes things incredibly flexible and powerful. It's also incredibly straightforward.

You have a favorite scraping library? No problem. You need a library to convert some obscure file format into something sensible? No problem.

All you need to is specify the libraries that you want to use in your scraper repository. Each language does this slightly differently using the native tools for that language.

If you're already familiar with using Heroku, this will be even simpler for you, as morph.io's system for libraries is built on top of Buildpacks, the same technology that drives the installation of libraries on Heroku.

To have morph.io install specific gems for your scraper, add a Gemfile to your repository. For instance to install the mechanize and sqlite3 gems:

source 'https://rubygems.org'
gem "mechanize"
gem "sqlite3"

Then run bundle update. This will work out which specific versions of each gem will be installed and write the result of that to Gemfile.lock.

Make sure that you add both Gemfile and Gemfile.lock to your repository.

PHP in morph.io uses Composer for managing dependencies and runtime. You create a file composer.json in the root of your scraper repository which says what libraries and extensions you want installed.

For example, to install the XSL PHP extension your composer.json could look like this

{
  "require": {
      "ext-xsl": "*"
  }
}

Then run composer install. Depending on whether you have Composer locally or globally installed on your personal machine you can run either

php composer.phar install

or

composer install

As well as installing the libraries locally it will also create a composer.lock file which should be added alongside the composer.json file into git.

The next time the scraper runs on morph it will build an environment from this.

For more on the specifics of what can go in composer.json see the Composer documentation.

For Python morph.io installs libraries using pip from a requirements.txt file in the root of your scraper repository. The format for requirements.txt is straightforward.

For example to install specific version of the Pygments and SQLAlchemy library, requirements.txt could look like this

Pygments==1.4
SQLAlchemy==0.6.6

To choose which libraries to install you will need a file cpanfile in the root of your scraper directory. It can install anything from CPAN and has a very straightforward syntax.

For instance to install specific versions of HTTP::Message and XML::Parser your cpanfile should look like

requires "HTTP::Message", "6.06";
requires "XML::Parser", "2.41";

You don't have to specify the versions to install but it's recommended as otherwise different runs of the scraper could potentially use different versions of libraries.

Check cpanfile into git alongside your scraper and the next time it's run on morph it will install the libraries.

To have morph.io install packages from npm for your scraper, edit the package.json file in the root of your repository. You can edit this file by hand, or with npm install. For instance to install the express and sqlite3 packages:

{
  "name": "myscraper",
  "description": "a scraper that runs on morph.io",
  "version": "1.0.0",
  "dependencies": {
    "express": "^4.13.3",
    "sqlite3": "latest"
  }
}

Make sure that you do not add the node_modules directory to git. You should add this directory to your .gitignore file.