Documentation


Moving your scrapers from ScraperWiki Classic

It's extremely straightforward to migrate your scrapers from ScraperWiki Classic to morph.io. It takes a single click.

Simply go to your scraper on ScraperWiki classic and press the "Transfer to Morph.io" button

Transfer to morph

Read ScraperWiki's guide to the retirement of ScraperWiki Classic which also explains how this works.

Default libraries are compatible with ScraperWiki Classic

To make it super easy to migrate scrapers from ScraperWiki Classic the default libraries that your scraper can use are as close as possible to those installed on ScraperWiki Classic as of January 2014.

Once you've moved over to morph.io you can use whatever libraries you want. See the page "Using any library in your scraper"

When there is no Gemfile and Gemfile.lock in the scraper repository, a default version of those files is installed which is as close as possible to the ScraperWiki environment as it was in January 2014. This is done to make migration from ScraperWiki as easy as possible.

This is the Gemfile that emulates the ScraperWiki environment

source 'https://rubygems.org'

# Scraperwiki used ruby 1.9.2 - we want the default to be as similar as possible
# buildstep doesn't seem to work with ruby 1.9.2 - it can't find rubygems
# So we're using ruby 1.9.3 for the time being instead
# TODO: Use ruby 1.9.2

ruby '1.9.3'

# These specific gems and versions are those installed on ScraperWiki "classic" as of 22 Jan 2014
# These gems were installed in /var/lib/gems/1.9.1/gems/ on ScraperWiki
# If there were several versions of the same gem, only the latest is installed here.

gem "Ascii85", "1.0.1"
gem "aasm", "3.0.8"
gem "active_record_inline_schema", "0.5.7"
gem "activemodel", "3.2.6"
gem "activerecord", "3.2.6"
# Updated activeresource from 3.2.1 to 3.2.6 to avoid conflict with activemodel
gem "activeresource", "3.2.6"
gem "activesupport", "3.2.6"
gem "addressable", "2.2.7"
gem "alchemy_api", "0.1.2"
gem "arel", "3.0.2"
gem "awesome_print", "1.0.2"
gem "aws-sdk", "1.6.1"
gem "builder", "3.0.0"
gem "choice", "0.1.4"
gem "chronic", "0.6.7"
gem "data_miner", "2.3.4"
gem "domain_name", "0.5.2"
gem "errata", "1.1.1"
gem "faraday", "0.8.1"
gem "fast-stemmer", "1.0.1"
gem "fastercsv", "1.5.4"
gem "fixed_width-multibyte", "0.2.3"
gem "fletcher", "0.4.1"
gem "gdata", "1.1.2"
gem "google-spreadsheet-ruby", "0.1.8"
gem "google_drive", "0.3.1"
gem "hash_digest", "1.0.0"
gem "hashie", "1.2.0"
gem "highrise", "3.0.3"
gem "hpricot", "0.8.4"
gem "htmlentities", "4.3.1"
gem "httparty", "0.8.3"
gem "httpauth", "0.1"
gem "httpclient", "2.2.4"
gem "i18n", "0.6.0"
# Leaving johnson commented out for the time being as it depends
# on the mozilla js engine which I'm having trouble finding an easy
# way to install
#gem "johnson", "2.0.0.pre3"
gem "json", "1.6.5"
gem "jwt", "0.1.4"
gem "libxml-ruby", "2.8.0"
gem "log4r", "1.1.9"
gem "mechanize", "2.5.1"
gem "mime-types", "1.17.2"
gem "money", "5.0.0"
gem "monster_mash", "0.3.0"
gem "multi_json", "1.1.0"
gem "multi_xml", "0.5.1"
gem "multipart-post", "1.1.5"
gem "net-http-digest_auth", "1.1.1"
gem "net-http-persistent", "2.5.2"
gem "nokogiri", "1.5.0"
gem "ntlm-http", "0.1.1"
gem "oauth", "0.4.5"
gem "oauth2", "0.5.2"
gem "pdf-reader", "1.0.0"
gem "pismo", "0.7.2"
gem "polylines", "0.1.0"
gem "posix-spawn", "0.3.6"
gem "rack", "1.4.1"
gem "remote_table", "2.0.2"
gem "rfgraph", "0.3"
gem "roo", "1.10.0"
gem "ruby-ole", "1.2.11.2"
gem "ruby-rc4", "0.1.5"
gem "rubyzip", "0.9.4"
gem "sanitize", "2.0.3"
gem 'scraperwiki', git: "https://github.com/openaustralia/scraperwiki-ruby.git", branch: "morph_defaults"
gem "shoulda", "3.0.1"
gem "shoulda-context", "1.0.0"
gem "shoulda-matchers", "1.0.0"
gem "simple_oauth", "0.2.0"
gem "sinew", "1.0.3"
gem "spreadsheet", "0.6.5.7"
gem "stackdeck", "0.2.0"
gem "stringex", "1.4.0"
gem "tmail", "1.2.7.1"
gem "to_regexp", "0.1.1"
gem "todonotes", "0.1.0"
gem "trollop", "1.16.2"
gem "twitter", "4.5.0"
gem "typhoeus", "0.3.3"
gem "tzinfo", "0.3.33"
gem "unf", "0.0.4"
gem "unf_ext", "0.0.4"
gem "unix_utils", "0.0.14"
gem "upsert", "0.3.4"
gem "uuidtools", "2.1.3"
gem "webrobots", "0.0.12"

When there is no composer.json and composer.lock in the scraper repository, a default version of those files is installed which is as close as possible to the ScraperWiki environment as it was in January 2014. This is done to make migration from ScraperWiki as easy as possible.

This is the composer.json that emulates the ScraperWiki environment

{
    "repositories": [
        {
            "url": "https://github.com/openaustralia/scraperwiki-php.git",
            "type": "git"
        }
    ],
    "require": {
        "openaustralia/scraperwiki": "dev-morph_defaults",
        "ext-sqlite3": "*",
        "ext-pdo_sqlite": "*",
        "ext-gd": "*",
        "ext-mbstring": "*"
    }
}

When there is no requirements.txt or runtime.txt in the scraper repository, a default version of those files is installed which is as close as possible to the ScraperWiki environment as it was in January 2014. This is done to make migration from ScraperWiki as easy as possible.

This is the runtime.txt that emulates the ScraperWiki environment

python-2.7.6

And the requirements.txt

# Install custom version of scraperwiki library
-e git+http://github.com/openaustralia/scraperwiki-python.git@morph_defaults#egg=scraperwiki

# numpy, gemsim, matplotlib, pandas, scipy are commented out because
# all those libraries require numpy
# it looks like the python heroku buildpack doesn't support the installation of numpy

#numpy==1.6.1
BeautifulSoup==3.2.0
Creoleparser==0.7.4
Genshi==0.6
Jinja2==2.6
Markdown==2.2.0
Pygments==1.4
SQLAlchemy==0.6.6
Twisted==11.1.0
Unidecode==0.04.9
anyjson==0.3.3
argparse==1.2.1
beautifulsoup4==4.1.3
bitlyapi==0.1.1
blinker==1.2
cartodb==0.6
certifi==0.0.8
chardet==2.1.1
ckanclient==0.10
colormath==1.0.8
csvkit==0.3.0
dataset==0.5.2
demjson==1.6
dropbox==1.4
errorhandler==1.1.1
feedparser==5.0.1
fluidinfo.py==1.1.2
gdata==2.0.15
geopy==0.94.1
gevent==0.13.6
google-api-python-client==1.0beta8
googlemaps==1.0.2
greenlet==0.3.2
html5lib==0.90
httplib2==0.7.4
imposm.parser==1.0.3
jellyfish==0.2.0
mechanize==0.2.5
mock==0.7.2
networkx==1.6
ngram==3.3.0
nose==1.1.2
oauth2==1.5.170
oauth==1.0.1
oauthlib==0.1.2
openpyxl==1.5.7
ordereddict==1.1
pbkdf2==1.3
pdfminer==20110515
pexpect==2.4
pipe2py==0.9.2
pyOpenSSL==0.13
pycrypto==2.5
pycurl==7.19.0
pyephem==3.7.5.1
pyparsing==1.5.6
pyth==0.5.6
python-Levenshtein==0.10.2
python-dateutil==1.5
python-gflags==2.0
python-modargs==1.2
python-stdnum==0.7
pytz==2011k
rdflib==3.1.0
requests-foauth==0.1.1
requests==1.0.4
selenium==2.5.0
simplejson==2.2.1
suds==0.4
tweepy==1.7.1
tweetstream==1.1.1
w3lib==1.0
wsgiref==0.1.2
xlrd==0.7.1
xlutils==1.4.1
xlwt==0.7.2
xmltodict==0.4
# Temporarily commenting out ygl because it's gone missing from
# https://pypi.python.org/pypi/yql/0.7.2
#yql==0.7.2
zope.interface==3.8.0
lxml==2.3.3
chromium-compact-language-detector==0.031415
#distribute==0.6.21
#gensim==0.8.3
icalendar==3.0.1b1
#matplotlib==1.0.1
#pandas==0.4.3
pyquery==1.0
#scipy==0.10.0
scrapely==0.9

# These are ones where the original version wasn't available. So we went to the next
# nearest available one
# Changed this from 0.9.3 to 0.9.8
Fom==0.9.8
# Changed from 3.09 to 3.10
PyYAML==3.10
# Changed from 0.14.0.2841 to 0.14.1
Scrapy==0.14.1
# Changed from 15.2.4 to 15.6.2
adspygoogle.adwords==15.6.2
# Changed this from 2.0b9 to 3.0.2
nltk==3.0.2
# Changed this from 1.0.25 to 1.0.2
pydot==1.0.2
# Changed this from 0.21.1 to 0.22.3
M2Crypto==0.22.3


# TODO: Need to install parsley
#-e git+http://github.com/fizx/pyparsley.git@0eea1362bc38c5e0e3a1caa2358c472c8f6eb3bd#egg=pyparsley-dev

# TODO: Install igraph https://launchpad.net/igraph
#python-igraph==0.5.3

# TODO: Install R
# Changed this from 2.2.4dev-20111102 to 2.2.5
#rpy2==2.2.5

ScraperWiki Classic did not support Perl.

Nonetheless, When there is no cpanfile in the scraper repository, a default version is still installed with some useful libraries. This means that Perl scrapers on morph.io that were written before Buildpacks became standard still work unmodified.

requires "DBD::SQLite", "1.42";
requires "DBI";
requires "File::Slurp", "9999.19";
requires "HTML::Form", "6.03";
requires "HTML::Parser", "3.71";
requires "HTML::Tree", "5.03";
requires "HTTP::Message", "6.06";
requires "JSON", "2.90";
requires "JSON::XS", "3.01";
requires "LWP::Protocol::https", "6.04";
requires "Test::Exception", "0.32";
requires "Test::Pod", "1.48";
requires "Text::CSV", "1.32";
requires "Text::CSV_XS", "1.09";
requires "URI", "1.60";
requires "LWP", "6.06";
requires "XML::DOM", "1.44";
requires "XML::LibXML", "2.0116";
requires "XML::Parser", "2.41";
requires "XML::SAX::Expat", "0.51";
requires "XML::Simple", "2.20";
requires "XML::XPath", "1.13";
requires "Database::DumpTruck", "1.2";
Missing documentation. Help out by writing some.