A general whole-website links crawler.
|Author|E-Mail| |---|---| |Lencondafirstname.lastname@example.org|
Since data science becomes more and more popular, web spiders and crawlers are taking an important place -- they are the cornerstone of this young field.
It is normal for us to find a crawler project written in Python, Java or C++. However, we could rarely get a Node.js crawler. Now, the
capture is in this case.
Since the project was published, the crawler has the following features:
Promiseand ECMAScript™ 6
Language and platform technologies are as below:
The default module that makes Ajax requests is
axios, and the project makes
async requests by using it. The default HTML parser is
cheerio, a light-weight jQuery operation library. The core component of the project is built based on the two modules.
Differ from other Node.js projects, with the help of TypeScript, ES 6
export pattern are available in the project, instead of Node.js' CommonJS
require to import other modules.
The project contains many components, and I should
tree them, for we can easily understand all the modules and what they do as a component of the project.
$ tree .
├── index.ts # Entry point
├── interface # TypeScript interfaces
│ └── dbmodels.ts
├── LICENSE # License file
├── src # Core component
│ └── index.ts # Main entry
├── capture.sample.ini # Configuration example file
├── package.json # NPM package file
├── test # Test scripts
│ ├── dbmodels.test.ts
│ ├── file.test.ts
│ ├── queue.test.ts
│ ├── urls.test.ts
│ └── config.test.ts
├── tsconfig.json # TypeScript configuration file
├── README.md # Docunents and guides
├── database.sql # Database script
├── .gitignore # Ignore local files and folders
├── typings # TypeScript typings files
│ ├── configparser.ts
│ └── modern-random-ua.d.ts
├── index.ts # Utils main entry
├── conf.ts # Configuration parser
├── dbmodels.ts # Database operation models
├── file.ts # File operation module
├── queue.ts # Definitions of datasructure of Queue
├── loggers.ts # Logger initializer
├── time.ts # Time utilities
└── urls.ts # URL utilities
Download source code from releases, or clone the project directively:
$ git clone email@example.com:lenconda/capture.git
or via HTTPS
$ git clone https://github.com/lenconda/capture.git
After preparing code, we should install
npm dependencies, before starting installation, we should make a check on system environment. The environment requirements is as bellow:
|Item|Requirement| |----|-----------| |Node.js|>= 8.0.0| |npm|>= 6.0.0|
if the environment configurations fulfilled, then install dependencies:
$ npm install
$ npm install yarn -g
$ yarn add
capture.sample.ini file is provided in the code. After dependencies installation, we should copy this configuration sample file as
capture.ini. However, there are many configuration options in that file, so I think I should make some explanation on them.
The following code is the whole content of
```ini [Application] seedurl = http://www.example.com maxdepth = 16 timeout = 1000 log_dir = /path/to/log
[Database] host = 127.0.0.1 user = root password = 123 database = capture port = 3306 ```
The definitions of each block are as below:
Application.seed_url|The directory to output log files. If this is not null, all the logs will output to the directory. Pay attention, must confirm if the directories is really EXIST! Or the project will throw an
Error() after starting.|
Application.timeout|System timeout, this can be used in some big iterations. For example:
Application.seed_url|The website to be crawled. The url must contains
host. For example:
https://www.example.com is good,
www.example.com is not good.|
Application.max_depth|The max depth for
capture to crawl.|
Database.host|Database host IP address.|
Database.user|Username of database.|
Database.password|Password of the user.|
Edit new copied
capture.ini and save it.
capture uses MySQL to store data produced during the process, an high available and connectable MySQL server should be deployed for it. After finishing configuring MySQL service, login to the server, run:
captureSOURCE '/path/to/capture/database.sql' ```
$ /path/to/mysql --database=capture --user=$USERNAME --host=$HOSTNAME --port=$PORT < /path/to/capture/database.sql
Otherwise, a better way for deploy MySQL server is using Docker:
$ docker pull mysql:5.7 $ docker run --name capture -p 3306:3306 -v /path/to/data:/var/lib/mysql -e MYSQLROOTPASSWORD=$PASSWORD -d mysql:5.7 ```
Then configure the parameters in
NOTICE: The SQL script set database with coding
utf8mb4, please ensure that your MySQL version is higher than
As is said above,
capture will support Docker, but until now, it is still at experiment period. However, it does not meen you cannot deploy
capture with Docker. If you want to deploy with docker, please take some advice as following:
The project uses
chai as default test suit and assert tools. To start test, run:
$ npm run test
cd into the project root directory
$ cd /path/to/capture
$ npm start
$ yarn start
Then the project will start, logs will be print to
stdout, and the logs will also be generated.
Although this project is strong, there are still many problems that may happen during the process. For any problems, please open an issue at
EXCLUDE the following cases:
Thanks for your interest in this project. You are welcomed to make contributions on it. However, before you starting your contribution work, please read the following advice:
As said above, before you starting your work, you should check issue list first. The issue list of this project can probably contains known bugs, problems, new demands and future development plans. If you can find an issue or many issues that solves the same problem, it would be great if you can join them to solve the problem.
If you decide to write your code in this project, you can fork this project as your own repository, check out to a new branch, from the newest code at
master. The new branch would be your work bench.
If you want to commit your changes of code, you should make an pull request, once you submit the request, the review process will start, if the code meets the requirements, the pull request will pass, and then your code will be in the project. If the request does not be passed, please contact firstname.lastname@example.org or email@example.com.
``` MIT License
Copyright (c) 2017 Vladislav Stroev
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ```
Total run time: less than 5 seconds
Total cpu time used: less than 5 seconds
Total disk space used: 137 KB