Scrape The World With Node js

  • 6/22/2019
  • 305 views

Table of Contents:

  1. Inspiration

  2. What is Web Scraping?

  3. What are The Needed Tools?

  4. Let’s Code.

  5. The Most Frequent Challenges That You Maybe Face

  6. Conclusion

 

Inspiration:

In many cases, as you’re working as a software engineer or data scientist, you may need a data set to work on to perform some analytical methods or build a search engine software that searches through this data.

 

What is Web Scraping?

Web scraping is a term for various methods used to collect information from across the Internet. Generally, this is done with simulating human Web surfing to collect specified bits of data from different websites.

 

What are The Needed Tools?

  1. Nodejs

  2. hapi.js

  3. cheerio

  4. request and request-promise

 

The Most Frequent Challenges That You Maybe Face

What should I do if I’m scraping single page application?

You maybe face a problem if you are trying to scrape website that was built with React or any javascript framework, and that’s the page content generated and rendered by javascript, so you will not get the expected data.
What you have to do is to use one of Web Automation Tools like Puppeteer or Nightmare … these tools open a browser instance in which our website can behave just like any browsers and then executes the javascript files which render the dynamic content of the website.

What should I do to prevent my scraper from being blocked?

If you’re making a lot of requests on your target website then your IP maybe get blocked … to prevent that you have to make your crontask to sleep a little bit … and I mean by the crontask is the service that talks to your scraper and gives it the order to scrape specific URLs.
Or if you have to scrape a lot of data in small time, then you have to serve the scraper on multiple servers and install a load balancer to balance the request traffic on these servers.

What should I do to scrape data that depends on a specific session?

Well, you can use Jar object in the Request package this will save your session and cookies across all your requests.

 

Let’s Code.

Let’s assume that you are interested in Soccer Like me and you want to gather Recent data on The Soccer news from skysports.com.

First things first you’ll need to install node js if you don’t have it … after installation you should have something like that in terminal

terminal

Setup

The setup is straightforward… we’ll create a new folder and run this command inside it to create a package.json file.

npm init -y

Then we will start to install our packages.

npm install request request-promise cheerio @hapi/hapi
You may ask why we didn’t use express or sails instead of hapi? well it depends on your requirements, in our case we need only a micro framework to handle any request from another service will use this scraper.

Start Build The Scraper

Create an index.js file, and that will be our entry point .. after setup our server it will look like that.

index.js

This starts our server on port 3000 … and creates a route to ScrapingHandlerthat will be responsible for the scraping business … now we need to create ScrapingHandler.js in the same directory.

scrapingHandler

Now we have to specify which data that we want from this page, as you can see this listing page has 20 articles, each article has a title, image, link, Date and small description let’s consider that we want them all and return an array of items.

As we see, every article has a news-list__item class so we can go from here and loop on each item of them.

https://www.skysports.com/football/news

Now let’s code again.

scrapingHandler

request-promise is what you need to sent HTTP requests to any URI and retrive the response data.
Cheerio will help you to deal with the incoming HTML Like jqurey, so you can extract data easily if you know some basics of jquery.

Let’s explain what we did.
We are calling the RequestPromise which will send HTTP request to the URI that we defined in the options object.
After the resolve, it will call the transform function that we defined in the options object, and what it’s doing is returning a cheerio object from the resolved data to use it in the then role.
In the then function we are receiving the cheerio object as $, so it can be easy to use like the jquery syntax.

scrapingHandler

Now we created an article object that will be pushed later into the articles array that we defined before.
As you can see, we are searching on specific elements in each article exactly like standard jquery selectors.
Now if we run it and hit http://localhost:3000/, This will return to the browser something like that.

Scraper Results

So far, so good.
But you maybe notice that we have a small problem here … and that is every image has the same URL … interesting right?
That’s because skysports.com uses the Lazy Loading technique on its images. So the images will be loaded using javascript after the page is loaded, so we can’t catch it.

So if you look again into the HTML especially on this img tag, you will see that it has an attribute called data-src that will be copied later to the src tag using the lazy loading.

scrapingHandler

And we can see the results will be like that

Scraper Results

Now you have the data you want … you can use this app as a service into your project, or you maybe wish to this data to be saved into a file or something does whatever you like.


The Most Frequent Challenges That You Maybe Face

What should I do if I’m scraping single page application?

You maybe face a problem if you are trying to scrape website that was built with React or any javascript framework, and that’s the page content generated and rendered by javascript, so you will not get the expected data.
What you have to do is to use one of Web Automation Tools like Puppeteer or Nightmare … these tools open a browser instance in which our website can behave just like any browsers and then executes the javascript files which render the dynamic content of the website.

What should I do to prevent my scraper from being blocked?

If you’re making a lot of requests on your target website then your IP maybe get blocked … to prevent that you have to make your crontask to sleep a little bit … and I mean by the crontask is the service that talks to your scraper and gives it the order to scrape specific URLs.
Or if you have to scrape a lot of data in small time, then you have to serve the scraper on multiple servers and install a load balancer to balance the request traffic on these servers.

What should I do to scrape data that depends on a specific session?

Well, you can use Jar object in the Request package this will save your session and cookies across all your requests.


Conclusion

In this article, you learned how to scrape data using node js and cheerio also you got your hands dirty with code and built a simple web scraper, and had an idea about what are the problems you may face while you code.

You can find the code in this repo on GitHub.

Feel free to leave comments or ask me any questions.

Thank you for reading.

Related Articles