There are tons of scraping tools out there for Node.js like cheerio, but what if there is an empty DOM with only a bunch of JavaScripts? The bigger challenge is what if you want to scrape thousands of pages of a specific website without getting your IP blocked?

I was given a similar task to scrape all the users' data and fetch some specific fields of a well-known website as they didn't have any APIs to fetch the data. That website was based on React. I was comfortable with Cheerio, and I started coding with it and as expected I couldn't fetch any data.

Choosing the tool

I knew that I can use Puppeteer for testing the Client Rendered Applications, and I have to use it if I want the DOM to load completely before fetching the data. (Because I didn't know any other tools except Puppeteer). I wrote the script, added the loops and everything was set up. I used one cloud instance as I was using my company's network and I didn't want it to get blocked if anything goes wrong. And again, as expected, after fetching merely 80-100 users' data, the IP of that instance got blocked.

Prevent IP blocking

Now I had to come up with a new solution because obviously, I can't deploy a new instance after every 100 requests. I thought about the solution for some time and there was only option kept coming to my mind - "Use some Proxy". And I chose the Tor network because it is very easy to set up and code in Node.js.

I was using an Ubuntu 18 instance and the process is the same for Mac OS. (I don't know about Windows and I never got a chance. So if you know how to do this, please share the knowledge)

Before installing any package, it is recommended to update all the packages using:

sudo apt-get update

Installing Tor in Ubuntu

sudo apt-get install tor

In macOS, using HomeBrew

brew install tor

After installing, it will automatically connect to the Tor network which we don't want to happen. We want to connect to the network programmatically.

  • If you want to check if it is connected to the Tor network, run this command. It should show some different IP.
curl --socks5 127.0.0.1:9050 checkip.amazonaws.com
  • To kill the existing tor process,
kill -9 `ps aux | grep tor | awk '{print $2}'`

Code

In this code, you can run this code over a loop. The index.js file will connect to a different IP on each iteration so that the website you're scraping can't block your IP address. Please don't try to harm any website as you may cause a DDOS attack.

proxy.js

To connect to the Tor network.

This is the minimal code without any error handling for a POC. Try to handle all the errors if you're planning to publish this in a production environment.

index.js


Footnotes

Do let me know if there is an error with this code or you used this for some cool stuff. Also, if you know the procedure in Windows, don't hesitate to add a comment. And once again, please do not try to harm any website using this method.

Until next time.