Web Crawler

Preview mode. Log in to write architecture notes, save your answer, and get feedback.

Medium

Web Crawler

Design a web crawler that can crawl the entire web and build a search index. Functional Requirements: - Start from a set of seed URLs and discover new URLs by parsing web pages - Download and store the content of web pages - Extract and follow links to discover new pages - Respect robots.txt and crawl rate limits per domain - Avoid crawling duplicate pages - Prioritize important/fresh pages Non-Functional Requirements: - Crawl billions of pages - Be polite - don't overwhelm any single website - Handle various content types and encodings - Fault-tolerant - resume after failures Scale: - Crawl 1 billion pages per month - Store petabytes of web content

Examples

Example 1

How would you detect and handle duplicate content (same content at different URLs)?

Approach hint

Think about the URL frontier as more than just a queue - it needs to support priorities and per-domain rate limiting.

Common mistake

Skipping assumptions, edge cases, or trade-offs can make an otherwise good answer feel incomplete.

architecture-notes.md