morph.io: Scraping JavaScript heavy sites

Scraping JavaScript heavy sites

In most circumstances web scraping is done by downloading a web page using your programming language and a library. However in some cases you need to simulate a web browser. This could be because the web site is poorly programmed or because it makes extensive use of JavaScript.

You can do this by using a headless browser. On morph.io Google Chrome is pre-installed ready for you to use.

Using Chrome Headless

The Google Chrome binary is installed at /usr/bin/google-chrome on every morph.io container as part of the build process. You can use Google Chrome directly by running it with google-chrome --headless --disable-gpu or you can control it via WebDriver with ChromeDriver which is installed on morph.io at /usr/local/bin/chromedriver.

Usage

With Capybara and Selenium Webdriver you can control Chrome. To install them add them to your scraper Gemfile:

gem 'capybara'
gem 'selenium-webdriver'

Then in your scraper start a Capybara session using selenium_chrome_headless:

require "capybara"
require "selenium-webdriver"

capybara = Capybara::Session.new(:selenium_chrome_headless)
# Start scraping
capybara.visit("https://morph.io/")
puts capybara.find("#banner h2").text

Missing documentation. Help out by writing some.

Examples

Sometimes there's nothing like seeing a real-life example. If you have a scraper you would like to add to this list, please let us know.

openaustralia/example_ruby_chrome_headless_scraper

Example scraper showing how to use Chrome headless from a ruby scraper

Documentation

Scraping JavaScript heavy sites

Using Chrome Headless

Usage

Examples