Documentation


Scraping JavaScript heavy sites

In most circumstances web scraping is done by downloading a web page using your programming language and a library. However in some cases you need to simulate a web browser. This could be because the web site is poorly programmed or because it makes extensive use of JavaScript.

You can do this by using a headless browser. On morph.io there are two different headless browsers pre-installed ready for you to use, Google Chrome and PhantomJS. We recommend using Google Chrome as PhantomJS is now deprecated. However, you will notice the documentation below for using Google Chrome from different scraping languages is pretty non-existent. If you would like to contribute to the documentation that would be amazing!

Using Chrome Headless

The Google Chrome binary is installed at /usr/bin/google-chrome on every morph.io container as part of the build process. You can use Google Chrome directly by running it with google-chrome --headless --disable-gpu or you can control it via WebDriver with ChromeDriver which is installed on morph.io at /usr/local/bin/chromedriver.

Usage

With Capybara and Selenium Webdriver you can control Chrome. To install them add them to your scraper Gemfile:

gem 'capybara'
gem 'selenium-webdriver'

Then in your scraper start a Capybara session using selenium_chrome_headless:

require "capybara"
require "selenium-webdriver"

capybara = Capybara::Session.new(:selenium_chrome_headless)
# Start scraping
capybara.visit("https://morph.io/")
puts capybara.find("#banner h2").text
Missing documentation. Help out by writing some.
Missing documentation. Help out by writing some.
Missing documentation. Help out by writing some.
Missing documentation. Help out by writing some.

Using PhantomJS (deprecated)

You can control PhantomJS from your scraper using JavaScript or a wrapper library. The PhantomJS binary, /usr/bin/phantomjs, is installed on morph.io as part of the phantomjs Ubuntu package, which is installed in every scraper container as part of the build process. On your own machine, you’ll need to download and install the binary yourself before you can run a scraper using PhantomJS.

Usage

The Poltergeist Gem provides a PhantomJS driver for Capybara.

To install it, add poltergeist to your scraper Gemfile:

gem "poltergeist"

Then in your scraper start a Capybara session using Poltergeist:

require "capybara/poltergeist"

capybara = Capybara::Session.new(:poltergeist)
# Start scraping
capybara.visit("https://morph.io/")
puts capybara.find("#banner h2").text
Missing documentation. Help out by writing some.

The Splinter library provides a higher-level wrapper for the PhantomJS and Selenium frameworks.

To install it, add splinter to your requirements.txt:

splinter>=0.7.3

Then in your scraper,

from splinter import Browser

with Browser("phantomjs") as browser:
    # Optional, but make sure large enough that responsive pages don't
    # hide elements on you...
    browser.driver.set_window_size(1280, 1024)

    # Open the page you want...
    browser.visit("https://morph.io")

    # submit the search form...
    browser.fill("q", "parliament")
    button = browser.find_by_css("button[type='submit']")
    button.click()

    # Scrape the data you like...
    links = browser.find_by_css(".search-results .list-group-item")
    for link in links:
        print link['href']
Missing documentation. Help out by writing some.
Missing documentation. Help out by writing some.

Examples

Sometimes there's nothing like seeing a real-life example. If you have a scraper you would like to add to this list, please let us know.

Ruby openaustralia/example_ruby_chrome_headless_scraper
Example scraper showing how to use Chrome headless from a ruby scraper

morph.io, faye.morph.io, www.gravatar.com, and 6 others

Ruby openaustralia/example_ruby_phantomjs_scraper
Ruby tmtmtmtm/cabo-verde-assembleia-nacionale
Cape Verde National Assembly deputies

www.parlamento.cv

Python wfdd/inatsisartut-scraper
A scraper for the Greenlandic parliament.