Design A Web Crawler
Get startedজয় শ্রী রাম
🕉- Pre-requisite: Knowledge of BFS, which is discussed in details in our Algorithms course.
The whole web could be represented using a graph, where each url would be vertex and there would be a directed edge from url A to url B if web content of url A contains (i.e, points to) url B.
In order to crawl web we should start from the starting url, read the html content for the url and discover all the urls found in the html content. Now repeat the same thing for all the discovered urls.
Now a very important thing to consider: we cannot just go on crawling forever. The Indexed World Wide Web (The Internet) contains at least 5.27 billion web pages as of Wednesday, 31 March, 2021. So the depths of the paths in the graph representing WWW from root (url from where web crawling is started) to the leaves could be very huge. If we are not interested in discovering all the web pages out there:
-
we should use BFS and not DFS, because we might be interested more in discovering the
nodes (web pages / links / urls) which are closer to the root (starting web page). Think
of it as how we are more interested in discovering first degree and second degree connections
compared to the higher degree ones.
Since DFS goes deeper till it reaches leaf and is forced to backtrack, by using DFS we would discover higher degree connections first, before discovering all the lower degree connections.
Since BFS is a level order search (searches level by level) we would discover lower order connections first. - we should have a termination condition so that we know when to stop crawling if the crawling takes a long time.
Enjoy building a simple web crawler as shown in code below. Don't forget to run the code in your local machine and play around with it, and learn as much as possible from it.
Login to Access Content
To learn about a bit more advanced Web Crawler, take a look at Multithreaded Web Crawler.