Documentation


Running scrapers on your own machine

When you're writing scrapers it can be very handy to develop your scrapers on a local machine. Here we explain how you can do that.

It also means that you don't have to push your changes to Github every time you want to test your scraper on morph.io.

There are two distinct ways that you can do this:

  1. Install the morph command-line which is a small tool that you can run locally. It will upload the scraper in your current directory to the morph server and stream back the results.

    Advantage:
    Will work with any scraper language without needing to install anything extra
    Disadvantage:
    Requires a network connection and is slower than truly running the scraper locally. Need to install Ruby if not already installed.

  2. Install the development tools for your language of choice on your machine. This is the easiest approach if you already have a development environment setup or if you only plan to use or write scrapers in a single language.

    Advantage:
    Faster than testing scrapers using the morph command-line client
    Disadvantage:
    Libraries used by scraper need to be installed locally as well

How to install the morph command-line client

Either install the morph command-line client using these instructions or install development tools on your machine. You don't need to do both.

First, check you have Ruby 1.9 or greater installed,

ruby -v

If you need to install or update Ruby see the instructions at ruby-lang.org.

Then install morph command line gem,

gem install morph-cli

To run a scraper go to the directory you have your scraper code in and

morph

It will run your local scraper on the morph.io server and stream the console output back to you. You can use this with any support scraper language without the hassle of having to install lots of things.

How to install development tools on your machine

Either install development tools on your machine using these instructions or install the morph command-line client. You don't need to do both.

  1. Install Ruby

  2. Install bundler: a Ruby package manager you’ll use to install the scraperwiki gem.
    Also install the sqlite development headers because scraperwiki will need them.

    $ sudo apt-get install bundler libsqlite3-dev

  3. Fork the repo you want to work on, or start a new one.

  4. Clone it:

    mkdir oaf
    cd oaf
    git clone git@github.com:yourname/example.git
    cd example

  5. If there’s no Gemfile, use this simple one:

    source 'https://rubygems.org'
    gem 'scraperwiki', git: 'https://github.com/openaustralia/scraperwiki-ruby.git', branch: 'morph_defaults'
    gem 'mechanize'

  6. Use bundler to install these Ruby gems locally:

    bundle install --path ../vendor/bundle
    This will create a file called Gemfile.lock.
    Make sure that you add both Gemfile and Gemfile.lock to your repository.

  7. Run the scraper. Use bundler to initialize the environment:

    bundle exec ruby scraper.rb

Missing documentation. Help out by writing some.
  1. Install Python

  2. Install virtualenv and pip for package management, and BeautifulSoup4 for HTML parsing:

    sudo apt-get install python-pip python-bs4 python-dev python-virtualenv

  3. Create a virtualenv and activate it.

    virtualenv --system-site-packages oaf
    source oaf/bin/activate

  4. Fork and clone the scraper you're going to work on:

    git clone git@github.com:yourname/example.git
    cd example

  5. Use pip to install the dependencies:

    pip install -r requirements.txt

  6. Run the scraper locally:

    python scraper.py

  1. Install Perl

  2. Fork and clone the scraper you're going to work on:

    git clone git@github.com:yourname/example.git
    cd example

  3. If there is cpanfile in the repository, install packages with modules in your distribution, or install each module from that file like:

    cpan module

  4. Run the scraper locally:

    perl scraper.py

  1. Install Node.js and npm.

  2. Clone the scraper you're going to work on:

    git clone git@github.com:yourname/example.git
    cd example

  3. Use npm to install the dependencies:

    npm install

  4. Run the scraper locally:

    node scraper.js