Pre requisites / Resources:
Goal
- Build a search engine (a smaller version of the Google Search), consisting of:
- Crawling: Fetch Web Pages
- Indexing: Build search indexes for faster search [Faster than O(n)].
- Query the data
Functional Requirements
- 1 Bil. unique new search query / daily
- Assume avg. query ~ 20 chars (4 words) -> 20 bytes
- Reads: faster than O(n)
- Data is humongous so not expecting O(1)
- Even O(log 2 n) is fine, log 2 100 Billi = ~37 operations!!
- Writes:
- Definitely not local -> think of shading on large distributed systems, but simplest as possible.
- Top search terms get updated once per day (cache for faster)
Capacity Approximation
- For indexing, use something like Inverted Index's logic, where documents have their text indexed for querying. But needs improvement.
Trie - Approach
Basically extended version of the Typeahead Suggestion's hash index method (here, word by word suggestion, instead of cache the whole result)
- With prefix: memory reqd: O(n(n+1)/2) = O(n^2)
- Here, if length at most = 5 chars, then footprint ~ O(n) (word suggestion for all prefix)
How it will work:
- Move pointer head from root to whatever the query is
- Each node has value pointed to end node. (closest from pointer)
- Cache at most "k" values (how much needed suggestions) per node. Example:
- hap -> ["happier", "happiest", "happy"] (k=3 suggestions)
Optimisations
1. Partitioning
- Ranged based partition would work here, distributed across multiple storage units
- Example:
2. Read Optimisations
- For bidirectional connections, web sockets?
- Avoid polling
- Keep track of pointer easily.
- instead of centralised server for socket connection, it can be with one of the shard's servers (ex: word: 'a..' then create socket with Partition-A)
Final System
- Stream Process: Tools like [Kafka] for aggregated new search query
- Batch Process: Something like [Spark] for batch / micro - batch process on Storage Units
- Storage Units: Distributed file System like [HDFS] for easy sharding, or even S3 but HDFS better for this use case.