Lessons learned building a modern, massive-scale web crawler

Behind the scenes, Mixnode is powered by an extremely efficient distributed, web-scale crawler capable of visiting hundreds of thousands of web pages per second. Although when using Mixnode you are never exposed to the web crawling side of things, we get asked a lot about how we manage to crawl so many web pages at such a high speed. In this post, I'll share with you some of the lessons we learned over the years building and optimizing our crawler.

Java

When choosing a programming language(s) for a project, many factors can affect your final decision. In-house expertise, ecosystem, and raw performance were the major criteria we had to take into account when searching for our "perfect" language. Eventually, we decided that Java would be the best for us, for the following reasons:

In-house expertise: We knew we could get started right away and produce high-quality software since our team had vast domain expertise in Java, more specifically around distributed systems and network software development.
Existing packages: A massive-scale web crawler needs to be built on top of robust, scalable and bullet-proof networking, system, and utility modules that have stood the test of time. Java has one of the most vibrant open source ecosystems, especially when it comes to networking and distributed applications. Packages such as Netty, Selenium, and Google Guava are a testament to the high quality of open source modules found in the Java ecosystem.
Existing reference projects: Apache Hadoop, Apache Cassandra, and Elasticsearch are all examples of the many large, distributed systems projects developed in Java which bring a wealth of expertise, inspiration, and precedent to the ecosystem. When things go wrong or when you have questions, there is usually someone who has been through the same or a similar situation. This creates a powerful network effect that makes it easier and more cost-effective to develop high-performance data-driven applications in Java.
Raw performance and reliability: Static typing, powerful garbage collection, and a mature, battle-tested virtual machine are among the most important characteristics of Java when it comes to performance and reliability.

Although our core web crawling engine is written in Java, we tend to be very pragmatic about our choice of language based on the task at hand. For example, we use other languages such as Python, Perl and Node.js for scripting, configuration, monitoring, reporting and other parts of the pipeline.

Shared-nothing architecture

Mixnode utilizes a shared-nothing architecture for clustering. The workload is partitioned and distributed across independent, stateless nodes. This eliminates single points of failure which can be catastrophic for a massive-scale distributed system. Additionally, this architecture allows us to update and upgrade the underlying software, node by node, without disrupting the entire operation. Additionally, a shared-nothing architecture dramatically reduces the communication overhead between nodes giving us an extra boost in performance.

Politeness and rate limiting modules must be bullet-proof

Websites are primarily designed with human visitors in mind and a typical user may browse through only a handful of pages every minute. Web crawlers, on the other hand, are capable of visiting thousands or even millions of web pages every second; so, if not careful, a web crawler can easily exhaust a website's resources within a very short period of time with devastating consequences. This problem is also magnified by the fact that a typical website is usually being crawled by multiple bots at the same time. Therefore, it is also every web crawler's responsibility to rate-limit its requests, in other words, to make sure that there is always an appropriate delay between consecutive visits to a website. The three most important criteria based on which you need to rate-limit your requests are hostname, and IP address.

Clearly, this is one area where you need to do a perfect job right from the start; there is no room for error here as a simple mistake can have devastating consequences for websites you are crawling. In a multi-threaded environment, you should also be extra careful to prevent race conditions when it comes to tracking requests and rate limiting parameters.

Cache is king

Caching network transactions in, at least, some parts of the pipeline is usually inevitable when building large-scale data-driven applications, especially when network I/O is more frequent and expensive relative to other tasks. However, in the case of web-scale crawling not only is caching inevitable, but it is something that you need to think about before you write your first line of code. There are two operations that will require immediate caching when crawling the web at scale:

Robots.txt lookups: getting a completely new copy of the robots.txt file of a host for every URL that you are going to visit from that host is practically impossible. So, you need to build a distributed pre-emptive cache capable of holding and, regularly updating robots.txt files of millions of websites.
DNS resolution: for the vast majority of URLs you need to perform at least one DNS resolution in order to download them, which adds up to thousands of lookups per second, very quickly. As a consequence DNS resolvers are bound to throttle your access or to crash under the load. In either case, your crawl will come to a screeching halt unless you frantically cache DNS resolution results and minimize unnecessary lookups.

Parsing HTML

One of the fundamental tasks of a crawler is to extract links from every page it visits (a.k.a. parsing) in order to add them to the long queue of pages that it needs to visit (a.k.a. the frontier). If you need to crawl at scale, you better have a high-performance HTML parser since you're going to be doing a lot of link and metadata extraction.

The majority of HTML parsing libraries prioritize simplicity, ease-of-use, and generality over raw performance, which considering a typical use case, is the right design. Since we needed high speed for link extraction, eventually we decided to write our own parser optimized for finding links and capable of some primitive DOM querying.

Your HTML parser also needs to be resilient, thoroughly-tested and capable of handling the numerous irregularities that come up in the wild: not everybody creates valid HTML documents.

Networking optimization

Operating systems are not usually pre-configured to handle the networking demands of a massive-scale web crawler. Optimizing the networking stack of the OS to its full potential is usually done on a case-by-case basis. For web crawling at scale, you are typically aiming to maximize throughput and the number of open connections. The following are some of the helpful resources on the subject that we refer to frequently:

Conclusion

Building a massive-scale web crawler is a long-term project and a complex undertaking. Different modules must be meticulously designed and tested while tradeoffs are carefully measured and studied. Many software components that we take for granted in our day-to-day computing needs do not function under the workloads of a web-scale crawler and need to be designed from scratch while other components are constantly reviewed and optimized for the ever-changing and ever-expanding web.

Our web-scale crawler has come a long way while maturing into a stable platform and we look forward to sharing more about the lessons that we learn building our infrastructure.

Blog