How I scraped 8000+ hotels from hotels.ng

Kayode Badewa
3 min readNov 8, 2017

This post is on how I scraped 8k+ hotels from hotels.ng using web-crawljs.

web-crawljs is an npm module that makes it very easy to crawl and scrape information from webpages. web-crawljs can crawl server-rendered parts of webpages.

It can’t crawl javascript generated parts of the web page. Since most websites in a way are server-rendered, it’s better compared to heavy RAM using web-drivers (e.g Phantomjs).

Check out the documentation for web-crawljs here.

How it Crawls

web-crawljs crawls web pages in a Breadth-First manner. Let’s represent web pages as Nodes and links as Edges. We start crawling from the first node then crawl all the nodes(pages) connected to its edge (links) before moving further down the tree.

The Wikipedia link to the breadth-first search here

image showing BFS

web-crawljshas the necessary features to make crawling easy. It supports depth, how many steps the crawl should take. It also has support how many nodes at most should be selected.
It makes it easy to extract and save information from pages. It supports dynamic behavior for the selection of elements and links to crawl next.

Only three modules are required to make the crawl:

  • Mongoose : A mongoose ODM
  • web-crawljs: For crawling the pages
  • dotenv: For environmental variables

Hotel Schema and Model

Away from web-crawljs, let’s talk about the hotel schema

The fields needed are:

  • Name of hotel
  • location
  • city
  • state
  • price
  • features of the hotels like bar, pool e.t.c.
  • link to hotels.ng description

I’m making use of Mongoose (ODM for Mongodb in Nodejs) to build the schema and connect to the mongodb database.

This is the code for the schema. Save the file as hotelModel.js

The crawling script

The crawl.js is where the crawling happens

This is all the code needed to get the hotels (easy right?). Using web-crawljs makes web crawling and scraping easy.

To crawl more hotels, increase the links in the config.urls and increase the config.depth.

To understand more on what is going on in the code above read the Documention.

Conclusion

We can see how easy it was to crawl and scrape a web page using web-crawljs. With the data gotten we can create APIs, websites, mobile apps e.t.c.

From the data gotten from hotels.ng, I created an API to serve the hotels I crawled at https://hotels-apis.herokuapp.com/. These are hotels available in Nigeria. Not all but close enough.

The documentation for the API documentation.

The code for the full API server and the crawler [https://github.com/kayslay/hotels].

The main thing is getting data, then making use of the data comes next.

--

--