Build a CLI to crawl a web page with web-crawljs

Kayode Badewa
Netscape
Published in
4 min readSep 20, 2017

--

In this tutorial we are going to create a web crawler that scraps information from wikipedia pages. This web crawler would run
from a command line interface (e.g. terminal, command prompt).

The code for this article is on github.

An example of the command that would crawl the page would look like

$ node crawl.js -d 3 -x wiki

The command will get a config file named `wiki` ,and saves the crawled data to a mongodb collection called `wiki`.

Web Crawling

Web crawler are programs written to get information from a web page.

“A Web crawler, sometimes
called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing” — wikipedia

What we will be needing

For this project we will be needing commander, web-crawljs
and mongoose.

Commander

Commander is an npm module that makes working with the command line interface easier. It makes it easy to handle command
line arguments. Check out it’s documentation.

web-crawljs

web-crawljs is an npm module that crawls web pages and extracts information from the page. It makes crawling web pages with nodejs easy.

The only thing that web-crawljs needs, is a configuration object for it to start crawling.

why web-crawljs

One of the reasons I chose web-crawljs is because of how easy it is to crawl web pages with it. It is also a light weight web crawler, that is, it uses far less CPU and RAM compared to using an headless browser (e.g Phanthomjs).

Due to the less CPU and RAM usage, it can not render SPA (single page application) pages. And also because I built It :).

All that is required to run it is nodejs, no need for installing phanthomjs on your machine; so far you have node installed, you are good to go.

mongoose

Mongoose is a MongoDB object modeling tool designed to work in an asynchronous environment. It’s an Object Data Modeling library that gives a modeling environment for mongodb and it enforces a more structured data model.

Mongoose gives us the ability to create mongodb data Models and Schemas.

We are going to use mongoose to save the information extracted from a page to mongodb database.

Project Structure

The structure of this project would look like this.

--crawler
------config
----------wiki.js
----------db.js
------crawl.js
------package.json

crawler/config

The main file in the crawler/config folder is the db.js. This file contains the configuration for our database. The wiki.js is the javascript file that will hold the configuration for web-crawljs.

Apart from db.js, all other files are configurations for web-crawljs.

crawl.js

This is the main file we will run to start crawling the web page. This is the entry point to start the crawl.

What we will crawl

In this article we are going to extract some information out of wikipedia and save it to a mongodb database. The information we want to extract from the page are:

  • title of the wiki content
  • content of the wiki page
  • all the reference link

Requirements

For this tutorial, nodejs and mongodb must be installed on your machine. And I’ll be making use of node 7.8.0 and mongodb version 2.6.10. I am also making making use of ES6 syntax (arrow function, destructuring e.t.c).

  • node >=v7.0.0
  • mongodb

Let’s get started

Now let’s go straight to business. we will start by creating a new folder called crawler

$ mkdir crawler
$ cd crawler #move into the folder

Now that it is done, let’s create the config directory inside the crawler directory

$ mkdir config
#create the config files
$ touch config/wiki.js config/db.js
#create the crawl.js file
$ touch crawl.js

time to create the package.json file. use the npm init -y command to create the package.json (using it because it’s easy).

$ npm init -y

Installing the dependencies

We are making use of only three dependencies in this project, the mongoose, commander and web-crawljs module. To install this module we will use our good friend npm. run npm install --save web-crawljs mongoose to install the dependencies.

$ npm install --save web-crawljs mongoose commander

Now that it is installed lets move to next stuff

config/db.js

This file holds the configuration details of our mongodb database

config/wiki.js

The config/wiki.js file holds the configuration we will use to crawl our wikipedia page.

crawl.js

The crawl.js file is the main file of this project. This file is what we will run using the node command. It’s our entry point.

It depends on two packages: web-crawljs and commander. Which were imported on line 5 and 6.

From line 9 to line 18 we set up the flags needed to be used by our CLI.

Thanks to commander this is very easy to achieve . Check it’s documentation for more.

Line 21 all the way down to line 37, configures the values gotten from the CLI.

The comment in the file should explain whats going on.

The lines that follow just performs the web crawl operation.

Let’s test our crawler

Now that all the code has been written, it’s time to test the crawler.

Type the following in your terminal

$ node crawl.js -x wiki

When we check our mongodb collection we will see the title, body and reference added to it.

Instead of using the default wikipedia url, we are going to use our own wiki page url.

$ node crawl -u https://en.wikipedia.org/wiki/Web_crawler -x wiki

This will not start crawling from the default https://en.wikipedia.org/, but would start crawling from https://en.wikipedia.org/wiki/Web_crawler. To add more urls, separate the url's by commas.

Conclusion

We now know how to create a web crawler using web-crawljs, commander and mongoose :).

And to those who don’t know how easy it is to create a Command Line Interface with Nodejs is; Now you know.

This is at least one more thing you know.

Thanks for reading and please recommend this post.

--

--