Web Crawler
Preview mode. Log in to write architecture notes, save your answer, and get feedback.
Web Crawler
Design a web crawler that can crawl the entire web and build a search index. Functional Requirements: - Start from a set of seed URLs and discover new URLs by parsing web pages - Download and store the content of web pages - Extract and follow links to discover new pages - Respect robots.txt and crawl rate limits per domain - Avoid crawling duplicate pages - Prioritize important/fresh pages Non-Functional Requirements: - Crawl billions of pages - Be polite - don't overwhelm any single website - Handle various content types and encodings - Fault-tolerant - resume after failures Scale: - Crawl 1 billion pages per month - Store petabytes of web content
Examples
How would you detect and handle duplicate content (same content at different URLs)?
Approach hint
Think about the URL frontier as more than just a queue - it needs to support priorities and per-domain rate limiting.
Common mistake
Skipping assumptions, edge cases, or trade-offs can make an otherwise good answer feel incomplete.