Suppose we want to search some keywords in a large collection of documents.

Then we index the docs where that keywords appear to make searches O(1) fast, storing only meta-data(ids) of documents for reference.

Screenshot from 2024-07-26 08-12-20.png

Here for (x,y) x= doc_id & y = word position, that can be added as well to add some position context for ranking searches

Additionally, sorting the keywords themselves is useful, as prefix based search, example: return for "hash" if user meant "hashmap" but hash doesn't exist.

Steps to Build an Inverted Index

Fetch the Document: Removing of Stop Words: “I”, “the”, “we”, “is”, and “an”.
Stemming of Root Word Whenever I want to search for “cat”, I want to see a document that has information about it. But the word present in the document is called “cats” or “catty” instead of “cat”. To relate both words, I’ll chop some part of every word I read so that I could get the “root word”. There are standard tools for performing this like “Porter’s Stemmer”.
Record Document IDs: If the word is already present add a reference of the document to index else creates a new entry. Add additional information like the frequency of the word, location of the word, etc.