lenconda / capture

A general whole-website links crawler.


CAPTURE

license contributions welcome

A general whole-website links crawler.


|Author|E-Mail| |---|---| |Lenconda|i@lenconda.top|


Introduction

Since data science becomes more and more popular, web spiders and crawlers are taking an important place -- they are the cornerstone of this young field.

It is normal for us to find a crawler project written in Python, Java or C++. However, we could rarely get a Node.js crawler. Now, the capture is in this case.

Features

Since the project was published, the crawler has the following features:

  • Support data persistence (with MySQL)
  • Use Breadth-First-Search (BFS) to make traverse
  • Based on Promise and ECMAScript™ 6 async/await
  • Support Docker™ (comming soon)
  • Cross platform (with Node.js)

Technologies

Language and platform technologies are as below:

  • JavaScript™
  • Node.js™
  • TypeScript™
  • MySQL™
  • Docker™
  • Mocha & Chai
  • Axios
  • Cheerio

The main reason for choosing TypeScript is because it makes the project easy to maintain and creates it with as few problems as possible. With its strong type, the problems will resolve sooner then normal JavaScript.

The default module that makes Ajax requests is axios, and the project makes async requests by using it. The default HTML parser is cheerio, a light-weight jQuery operation library. The core component of the project is built based on the two modules.

Differ from other Node.js projects, with the help of TypeScript, ES 6 class and import/export pattern are available in the project, instead of Node.js' CommonJS require to import other modules.

Structure

The project contains many components, and I should tree them, for we can easily understand all the modules and what they do as a component of the project.

$ tree . . ├── index.ts # Entry point ├── interface # TypeScript interfaces │   └── dbmodels.ts ├── LICENSE # License file ├── src # Core component │   └── index.ts # Main entry ├── capture.sample.ini # Configuration example file ├── package.json # NPM package file ├── test # Test scripts │   ├── dbmodels.test.ts │ ├── file.test.ts │ ├── queue.test.ts │ ├── urls.test.ts │   └── config.test.ts ├── tsconfig.json # TypeScript configuration file ├── README.md # Docunents and guides ├── database.sql # Database script ├── .gitignore # Ignore local files and folders ├── typings # TypeScript typings files │ ├── configparser.ts │   └── modern-random-ua.d.ts └── utils ├── index.ts # Utils main entry ├── conf.ts # Configuration parser ├── dbmodels.ts # Database operation models ├── file.ts # File operation module ├── queue.ts # Definitions of datasructure of Queue ├── loggers.ts # Logger initializer ├── time.ts # Time utilities └── urls.ts # URL utilities

Install

Download Source Code

Download source code from releases, or clone the project directively:

Via SSH bash $ git clone git@github.com:lenconda/capture.git

or via HTTPS bash $ git clone https://github.com/lenconda/capture.git

Install Dependencies

After preparing code, we should install npm dependencies, before starting installation, we should make a check on system environment. The environment requirements is as bellow:

|Item|Requirement| |----|-----------| |Node.js|>= 8.0.0| |npm|>= 6.0.0|

if the environment configurations fulfilled, then install dependencies:

bash $ npm install

or

bash $ npm install yarn -g $ yarn add

Configure

conf.example.ini

A capture.sample.ini file is provided in the code. After dependencies installation, we should copy this configuration sample file as capture.ini. However, there are many configuration options in that file, so I think I should make some explanation on them.

The following code is the whole content of capture.sample.ini:

```ini [Application] seedurl = http://www.example.com maxdepth = 16 timeout = 1000 log_dir = /path/to/log

[Database] host = 127.0.0.1 user = root password = 123 database = capture port = 3306 ```

The definitions of each block are as below:

|Name|Definition|Default| |----|----------|-------| |Application.seed_url|The directory to output log files. If this is not null, all the logs will output to the directory. Pay attention, must confirm if the directories is really EXIST! Or the project will throw an Error() after starting.|'./logs'| |Application.timeout|System timeout, this can be used in some big iterations. For example: time.delay().|1000| |Application.seed_url|The website to be crawled. The url must contains protocol, host. For example: https://www.example.com is good, www.example.com is not good.|'http://www.example.com'| |Application.max_depth|The max depth for capture to crawl.|16| |Database.host|Database host IP address.|'127.0.0.1'| |Database.user|Username of database.|'root'| |Database.password|Password of the user.|'123'| |Database.database|Database name.|'capture'| |Database.port|Database port.|3306|

Edit new copied capture.ini and save it.

Database

Since capture uses MySQL to store data produced during the process, an high available and connectable MySQL server should be deployed for it. After finishing configuring MySQL service, login to the server, run:

```sql

CREATE DATABASE capture USE capture SOURCE '/path/to/capture/database.sql' ```

or

bash $ /path/to/mysql --database=capture --user=$USERNAME --host=$HOSTNAME --port=$PORT < /path/to/capture/database.sql

Otherwise, a better way for deploy MySQL server is using Docker:

```bash

MySQL 5.7 is recommended

$ docker pull mysql:5.7 $ docker run --name capture -p 3306:3306 -v /path/to/data:/var/lib/mysql -e MYSQLROOTPASSWORD=$PASSWORD -d mysql:5.7 ```

Then configure the parameters in capture.ini

NOTICE: The SQL script set database with coding utf8mb4, please ensure that your MySQL version is higher than 5.5.x!

Docker

As is said above, capture will support Docker, but until now, it is still at experiment period. However, it does not meen you cannot deploy capture with Docker. If you want to deploy with docker, please take some advice as following:

  • Use Docker and Docker Compose
  • Pull MySQL image from official source
  • Make MySQL data folder out of containers, just as a volume attach to containers
  • Use pm2 to manage capture

Test

The project uses mocha and chai as default test suit and assert tools. To start test, run:

bash $ npm run test

Run

cd into the project root directory

bash $ cd /path/to/capture

start the node process

bash $ npm start

or

bash $ yarn start

Then the project will start, logs will be print to stdout, and the logs will also be generated.

Issues

Although this project is strong, there are still many problems that may happen during the process. For any problems, please open an issue at

https://github.com/lenconda/capture/issues

EXCLUDE the following cases:

  • Memory usage too large after a period of time
  • Database connection issues
  • The crawler stucks after a period of time

Contribute

Thanks for your interest in this project. You are welcomed to make contributions on it. However, before you starting your contribution work, please read the following advice:

  • Read the README first
  • Understand what changes you want to make
  • Look through the issue list and check if there's an issue to solve the same problem
  • Publish or/and redistribute this project should under MIT license

Issues

As said above, before you starting your work, you should check issue list first. The issue list of this project can probably contains known bugs, problems, new demands and future development plans. If you can find an issue or many issues that solves the same problem, it would be great if you can join them to solve the problem.

Fork & Pull Requests

If you decide to write your code in this project, you can fork this project as your own repository, check out to a new branch, from the newest code at master. The new branch would be your work bench.

If you want to commit your changes of code, you should make an pull request, once you submit the request, the review process will start, if the code meets the requirements, the pull request will pass, and then your code will be in the project. If the request does not be passed, please contact i@lenconda.top or prexustech@gmail.com.

LICENSE

``` MIT License

Copyright (c) 2017 Vladislav Stroev

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ```

Contributors lenconda

Last run failed with status code 255.

Console output of last run

Can't find scraper code. Expected to find a file called scraper.rb, scraper.php, scraper.py, scraper.pl, or scraper.js in the root directory

Statistics

Total run time: less than 5 seconds

Total cpu time used: less than 5 seconds

Total disk space used: 137 KB

History

  • Manually ran revision f3d38be5 and failed .
  • Created on morph.io

Scraper code

capture