Pre requisites / Resources:

Goal

Build a search engine (a smaller version of the Google Search), consisting of:
1. Crawling: Fetch Web Pages
2. Indexing: Build search indexes for faster search [Faster than O(n)].
3. Query the data

Functional Requirements

1 Bil. unique new search query / daily
Assume avg. query ~ 20 chars (4 words) -> 20 bytes
Reads: faster than O(n)
- Data is humongous so not expecting O(1)
- Even O(log 2 n) is fine, log 2 100 Billi = ~37 operations!!
Writes:
- Definitely not local -> think of shading on large distributed systems, but simplest as possible.
- Top search terms get updated once per day (cache for faster)

Capacity Approximation

For indexing, use something like Inverted Index's logic, where documents have their text indexed for querying. But needs improvement.

Trie - Approach

Basically extended version of the Typeahead Suggestion's hash index method (here, word by word suggestion, instead of cache the whole result)

Pasted image 20241103221807.png

With prefix: memory reqd: O(n(n+1)/2) = O(n^2)
Here, if length at most = 5 chars, then footprint ~ O(n) (word suggestion for all prefix)

How it will work:

Move pointer head from root to whatever the query is
Each node has value pointed to end node. (closest from pointer)
Cache at most "k" values (how much needed suggestions) per node. Example:
- hap -> ["happier", "happiest", "happy"] (k=3 suggestions)

Optimisations

1. Partitioning

Ranged based partition would work here, distributed across multiple storage units
Example:

2. Read Optimisations

For bidirectional connections, web sockets?
- Avoid polling
- Keep track of pointer easily.
- instead of centralised server for socket connection, it can be with one of the shard's servers (ex: word: 'a..' then create socket with Partition-A)

Final System

Pasted image 20241103223901.png

Stream Process: Tools like [Kafka] for aggregated new search query
Batch Process: Something like [Spark] for batch / micro - batch process on Storage Units
Storage Units: Distributed file System like [HDFS] for easy sharding, or even S3 but HDFS better for this use case.

Built with

DataHub Cloud