Case Study: Building a Mini Search Engine

Java Collections

17.1 Indexing Text with Maps and Sets

Indexing is the foundational step in building any search engine. It transforms raw text into a searchable structure known as an inverted index, which maps words (terms) to the documents or positions where they occur. This structure allows fast lookups during search queries.

Understanding the Inverted Index

An inverted index is a Map<String, Set<String>>, where:

The key is a word (term) extracted from the text (after processing),
The value is a Set of document identifiers (to avoid duplicates).

This is in contrast to a "forward index," which maps documents to the words they contain.

Text Processing Pipeline

Before indexing, input text should undergo:

Tokenization – Splitting text into individual words.
Normalization – Lowercasing, removing punctuation, stemming, etc.
Deduplication – Avoid recording the same word–document pair multiple times.

Example: Building a Simple Inverted Index

Below is a basic implementation that indexes a few text documents.

import java.util.*;

public class InvertedIndexExample {
    public static void main(String[] args) {
        // Sample documents with their IDs
        Map<String, String> documents = Map.of(
            "doc1", "Java is a high-level programming language.",
            "doc2", "Python is a popular programming language.",
            "doc3", "Java and Python are used in many applications."
        );

        // Inverted index: word -> set of document IDs
        Map<String, Set<String>> invertedIndex = new HashMap<>();

        // Build the index
        for (Map.Entry<String, String> entry : documents.entrySet()) {
            String docId = entry.getKey();
            String content = entry.getValue();

            // Tokenization and normalization
            String[] words = content.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+");

            for (String word : words) {
                // Add document ID to the set of documents for this word
                invertedIndex
                    .computeIfAbsent(word, k -> new HashSet<>())
                    .add(docId);
            }
        }

        // Display the inverted index
        System.out.println("Inverted Index:");
        for (Map.Entry<String, Set<String>> entry : invertedIndex.entrySet()) {
            System.out.println(entry.getKey() + " -> " + entry.getValue());
        }
    }
}

Expected Output

Inverted Index:
java -> [doc1, doc3]
is -> [doc1, doc2]
a -> [doc1, doc2]
highlevel -> [doc1]
programming -> [doc1, doc2]
language -> [doc1, doc2]
python -> [doc2, doc3]
popular -> [doc2]
and -> [doc3]
are -> [doc3]
used -> [doc3]
in -> [doc3]
many -> [doc3]
applications -> [doc3]

Analysis and Practical Notes

The use of a Set for document IDs ensures no duplicates, which is essential when the same word appears multiple times in the same document.
The index is built using computeIfAbsent(), a convenient method introduced in Java 8 for map initialization.
Normalization (e.g., lowercasing and punctuation removal) ensures that "Java" and "java" are treated the same, which improves search quality.
Stop words (e.g., "is", "a", "in") can be optionally removed to reduce index size and noise in real systems.

Extending the Index

In real-world systems, inverted indexes often store:

Word positions (Map<String, Map<String, List >>),
Frequencies for ranking,
Timestamps or metadata for filtering.

Conclusion

Using Map and Set, we can create a simple but functional inverted index for document search. This forms the basis for query processing, relevance ranking, and more advanced search engine features. The next step is to query this index efficiently — covered in the next section.

17.2 Query Processing with Queues and Lists

Once documents are indexed, the next step is processing search queries. This involves parsing user input, retrieving relevant documents from the inverted index, ranking them (optionally), and returning results. In Java, Queues and Lists play important roles in this stage.

Using Queues for Query Parsing and Execution

Queue is ideal for processing query terms in First-In-First-Out (FIFO) order. In a real-world system, search terms might be parsed and placed in a queue for processing, especially when queries are compound (e.g., "Java AND Python").

import java.util.*;

public class QueryProcessorWithQueue {
    public static void main(String[] args) {
        // Simulated inverted index
        Map<String, Set<String>> invertedIndex = Map.of(
            "java", Set.of("doc1", "doc3"),
            "python", Set.of("doc2", "doc3"),
            "programming", Set.of("doc1", "doc2")
        );

        // Queue of query terms in the order they were entered
        Queue<String> queryTerms = new LinkedList<>(List.of("java", "programming"));

        // Result set to hold document IDs that match ALL terms (AND search)
        Set<String> resultSet = null;

        while (!queryTerms.isEmpty()) {
            String term = queryTerms.poll();
            Set<String> docs = invertedIndex.getOrDefault(term, Set.of());

            if (resultSet == null) {
                resultSet = new HashSet<>(docs);
            } else {
                // Retain only documents that appear in both resultSet and current term's set
                resultSet.retainAll(docs);
            }
        }

        System.out.println("Matching documents: " + resultSet);
    }
}

Expected Output

Matching documents: [doc1]

Explanation: Both "java" and "programming" are present only in doc1.

Using Lists for Ordered Results

When presenting search results, a List is often used because:

It supports ordering of results (e.g., by relevance or recency),
It allows indexed access for pagination (e.g., result 1–10),
It supports sorting based on scores.

Here’s a simple example that ranks documents by number of matching terms:

import java.util.*;

public class RankedResultWithList {
    public static void main(String[] args) {
        Map<String, Set<String>> invertedIndex = Map.of(
            "java", Set.of("doc1", "doc3"),
            "python", Set.of("doc2", "doc3"),
            "programming", Set.of("doc1", "doc2", "doc3")
        );

        List<String> queryTerms = List.of("java", "python", "programming");

        // Score map: document -> count of matched terms
        Map<String, Integer> docScores = new HashMap<>();

        for (String term : queryTerms) {
            Set<String> docs = invertedIndex.getOrDefault(term, Set.of());
            for (String doc : docs) {
                docScores.put(doc, docScores.getOrDefault(doc, 0) + 1);
            }
        }

        // Create a list of results and sort by score descending
        List<Map.Entry<String, Integer>> results = new ArrayList<>(docScores.entrySet());
        results.sort((a, b) -> b.getValue() - a.getValue());

        // Print ordered results
        System.out.println("Ranked Results:");
        for (Map.Entry<String, Integer> entry : results) {
            System.out.println(entry.getKey() + " (score: " + entry.getValue() + ")");
        }
    }
}

Expected Output

Ranked Results:
doc3 (score: 3)
doc1 (score: 2)
doc2 (score: 2)

Summary

Queue ensures query terms are processed in order, especially useful for AND/OR evaluation.
List is suited for returning and sorting results, supporting rich operations like pagination and ranking.
Together, they help create an efficient and organized flow in query handling, from term analysis to result display.

In more advanced search engines, these data structures may work with priority queues (for top-k results), trees (for prefix searches), or graphs (for document linking), but Lists and Queues form the essential processing backbone.

Case Study: Building a Mini Search Engine

Java Collections

17.1 Indexing Text with Maps and Sets

Understanding the Inverted Index

Text Processing Pipeline

Example: Building a Simple Inverted Index

Expected Output

Analysis and Practical Notes

Extending the Index

Conclusion

17.2 Query Processing with Queues and Lists

Using Queues for Query Parsing and Execution

Expected Output

Using Lists for Ordered Results

Expected Output

Summary

Related Books