Build a CLI to crawl a web page with web-crawljs
In this tutorial we are going to create a web crawler that scraps information from wikipedia pages. This web crawler would run
from a command line interface (e.g. terminal, command prompt).
The code for this article is on github.
An example of the command that would crawl the page would look like
$ node crawl.js -d 3 -x wiki
The command will get a config file named `wiki` ,and saves the crawled data to a mongodb collection called `wiki`.
Web Crawling
Web crawler are programs written to get information from a web page.
“A Web crawler, sometimes
called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing” — wikipedia
What we will be needing
For this project we will be needing commander, web-crawljs
and mongoose.
Commander
Commander is an npm module that makes working with the command line interface easier. It makes it easy to handle command
line arguments. Check out it’s documentation.
web-crawljs
web-crawljs is an npm module that crawls web pages and extracts information from the page. It makes crawling web pages with nodejs easy.
The only thing that web-crawljs needs, is a configuration object for it to start crawling.
why web-crawljs
One of the reasons I chose web-crawljs is because of how easy it is to crawl web pages with it. It is also a light weight web crawler, that is, it uses far less CPU and RAM compared to using an headless browser (e.g Phanthomjs).
Due to the less CPU and RAM usage, it can not render SPA (single page application) pages. And also because I built It :).
All that is required to run it is nodejs, no need for installing phanthomjs on your machine; so far you have node installed, you are good to go.
mongoose
Mongoose is a MongoDB object modeling tool designed to work in an asynchronous environment. It’s an Object Data Modeling library that gives a modeling environment for mongodb and it enforces a more structured data model.
Mongoose gives us the ability to create mongodb data Models and Schemas.
We are going to use mongoose to save the information extracted from a page to mongodb database.
Project Structure
The structure of this project would look like this.
--crawler
------config
----------wiki.js
----------db.js
------crawl.js
------package.json
crawler/config
The main file in the crawler/config folder is the db.js
. This file contains the configuration for our database. The wiki.js
is the javascript file that will hold the configuration for web-crawljs.
Apart from db.js
, all other files are configurations for web-crawljs.
crawl.js
This is the main file we will run to start crawling the web page. This is the entry point to start the crawl.
What we will crawl
In this article we are going to extract some information out of wikipedia and save it to a mongodb database. The information we want to extract from the page are:
- title of the wiki content
- content of the wiki page
- all the reference link
Requirements
For this tutorial, nodejs and mongodb must be installed on your machine. And I’ll be making use of node 7.8.0 and mongodb version 2.6.10. I am also making making use of ES6 syntax (arrow function, destructuring e.t.c).
- node >=v7.0.0
- mongodb
Let’s get started
Now let’s go straight to business. we will start by creating a new folder called crawler
$ mkdir crawler
$ cd crawler #move into the folder
Now that it is done, let’s create the config directory inside the crawler directory
$ mkdir config
#create the config files
$ touch config/wiki.js config/db.js
#create the crawl.js file
$ touch crawl.js
time to create the package.json file. use the npm init -y
command to create the package.json (using it because it’s easy).
$ npm init -y
Installing the dependencies
We are making use of only three dependencies in this project, the mongoose, commander and web-crawljs module. To install this module we will use our good friend npm
. run npm install --save web-crawljs mongoose
to install the dependencies.
$ npm install --save web-crawljs mongoose commander
Now that it is installed lets move to next stuff
config/db.js
This file holds the configuration details of our mongodb database
config/wiki.js
The config/wiki.js file holds the configuration we will use to crawl our wikipedia page.
crawl.js
The crawl.js file is the main file of this project. This file is what we will run using the node
command. It’s our entry point.
It depends on two packages: web-crawljs and commander. Which were imported on line 5 and 6.
From line 9 to line 18 we set up the flags needed to be used by our CLI.
Thanks to commander this is very easy to achieve . Check it’s documentation for more.
Line 21 all the way down to line 37, configures the values gotten from the CLI.
The comment in the file should explain whats going on.
The lines that follow just performs the web crawl operation.
Let’s test our crawler
Now that all the code has been written, it’s time to test the crawler.
Type the following in your terminal
$ node crawl.js -x wiki
When we check our mongodb collection we will see the title, body and reference added to it.
Instead of using the default wikipedia url, we are going to use our own wiki page url.
$ node crawl -u https://en.wikipedia.org/wiki/Web_crawler -x wiki
This will not start crawling from the default https://en.wikipedia.org/
, but would start crawling from https://en.wikipedia.org/wiki/Web_crawler
. To add more urls, separate the url's by commas.
Conclusion
We now know how to create a web crawler using web-crawljs
, commander
and mongoose
:).
And to those who don’t know how easy it is to create a Command Line Interface with Nodejs is; Now you know.
This is at least one more thing you know.
Thanks for reading and please recommend this post.