Design a web crawler that can crawl the entire web and build a search index. **Functional Requirements:** - Start from a set of seed URLs and discover new URLs by parsing web pages - Download and store the content of web pages - Extract and follow links to discover new pages - Respect robots.txt and crawl rate limits per domain - Avoid crawling duplicate pages - Prioritize important/fresh pages **Non-Functional Requirements:** - Crawl billions of pages - Be polite - don't overwhelm any single website - Handle various content types and encodings - Fault-tolerant - resume after failures **Scale:** - Crawl 1 billion pages per month - Store petabytes of web content

Web Crawler

Preview mode. Log in to write architecture notes, save your answer, and get feedback.

Design a web crawler that can crawl the entire web and build a search index.

How would you detect and handle duplicate content (same content at different URLs)?
How would you decide when to recrawl a page (freshness)?
How would you extend this to handle JavaScript-rendered pages (SPAs)?

Functional Requirements

0 chars

Non-Functional Requirements

0 chars