Logo

Pre requisites / Resources:

Goal

  • Build a search engine (a smaller version of the Google Search), consisting of:
    1. Crawling: Fetch Web Pages
    2. Indexing: Build search indexes for faster search [Faster than O(n)].
    3. Query the data

Functional Requirements

  • 1 Bil. unique new search query / daily
  • Assume avg. query ~ 20 chars (4 words) -> 20 bytes
  • Reads: faster than O(n)
    • Data is humongous so not expecting O(1)
    • Even O(log 2 n) is fine, log 2 100 Billi = ~37 operations!!
  • Writes:
    • Definitely not local -> think of shading on large distributed systems, but simplest as possible.
    • Top search terms get updated once per day (cache for faster)

Capacity Approximation

  1. For indexing, use something like Inverted Index's logic, where documents have their text indexed for querying. But needs improvement.

Trie - Approach

Basically extended version of the Typeahead Suggestion's hash index method (here, word by word suggestion, instead of cache the whole result)

Pasted image 20241103221807.png

  • With prefix: memory reqd: O(n(n+1)/2) = O(n^2)
  • Here, if length at most = 5 chars, then footprint ~ O(n) (word suggestion for all prefix)

How it will work:

  • Move pointer head from root to whatever the query is
  • Each node has value pointed to end node. (closest from pointer)
  • Cache at most "k" values (how much needed suggestions) per node. Example:
    • hap -> ["happier", "happiest", "happy"] (k=3 suggestions)

Optimisations

1. Partitioning
  • Ranged based partition would work here, distributed across multiple storage units
  • Example: Pasted image 20241103222625.png
2. Read Optimisations
  • For bidirectional connections, web sockets?
    • Avoid polling
    • Keep track of pointer easily.
    • instead of centralised server for socket connection, it can be with one of the shard's servers (ex: word: 'a..' then create socket with Partition-A)

Final System

Pasted image 20241103223901.png

  • Stream Process: Tools like [Kafka] for aggregated new search query
  • Batch Process: Something like [Spark] for batch / micro - batch process on Storage Units
  • Storage Units: Distributed file System like [HDFS] for easy sharding, or even S3 but HDFS better for this use case.

© 2025 All rights reservedBuilt with DataHub Cloud

Built with LogoDataHub Cloud