Index

Case Study: Building a Mini Search Engine

Java Collections

17.1 Indexing Text with Maps and Sets

Indexing is the foundational step in building any search engine. It transforms raw text into a searchable structure known as an inverted index, which maps words (terms) to the documents or positions where they occur. This structure allows fast lookups during search queries.

Understanding the Inverted Index

An inverted index is a Map<String, Set<String>>, where:

This is in contrast to a "forward index," which maps documents to the words they contain.

Text Processing Pipeline

Before indexing, input text should undergo:

  1. Tokenization – Splitting text into individual words.
  2. Normalization – Lowercasing, removing punctuation, stemming, etc.
  3. Deduplication – Avoid recording the same word–document pair multiple times.

Example: Building a Simple Inverted Index

Below is a basic implementation that indexes a few text documents.

import java.util.*;

public class InvertedIndexExample {
    public static void main(String[] args) {
        // Sample documents with their IDs
        Map<String, String> documents = Map.of(
            "doc1", "Java is a high-level programming language.",
            "doc2", "Python is a popular programming language.",
            "doc3", "Java and Python are used in many applications."
        );

        // Inverted index: word -> set of document IDs
        Map<String, Set<String>> invertedIndex = new HashMap<>();

        // Build the index
        for (Map.Entry<String, String> entry : documents.entrySet()) {
            String docId = entry.getKey();
            String content = entry.getValue();

            // Tokenization and normalization
            String[] words = content.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+");

            for (String word : words) {
                // Add document ID to the set of documents for this word
                invertedIndex
                    .computeIfAbsent(word, k -> new HashSet<>())
                    .add(docId);
            }
        }

        // Display the inverted index
        System.out.println("Inverted Index:");
        for (Map.Entry<String, Set<String>> entry : invertedIndex.entrySet()) {
            System.out.println(entry.getKey() + " -> " + entry.getValue());
        }
    }
}

Expected Output

Inverted Index:
java -> [doc1, doc3]
is -> [doc1, doc2]
a -> [doc1, doc2]
highlevel -> [doc1]
programming -> [doc1, doc2]
language -> [doc1, doc2]
python -> [doc2, doc3]
popular -> [doc2]
and -> [doc3]
are -> [doc3]
used -> [doc3]
in -> [doc3]
many -> [doc3]
applications -> [doc3]

Analysis and Practical Notes

Extending the Index

In real-world systems, inverted indexes often store:

Conclusion

Using Map and Set, we can create a simple but functional inverted index for document search. This forms the basis for query processing, relevance ranking, and more advanced search engine features. The next step is to query this index efficiently — covered in the next section.

Index

17.2 Query Processing with Queues and Lists

Once documents are indexed, the next step is processing search queries. This involves parsing user input, retrieving relevant documents from the inverted index, ranking them (optionally), and returning results. In Java, Queues and Lists play important roles in this stage.

Using Queues for Query Parsing and Execution

Queue is ideal for processing query terms in First-In-First-Out (FIFO) order. In a real-world system, search terms might be parsed and placed in a queue for processing, especially when queries are compound (e.g., "Java AND Python").

import java.util.*;

public class QueryProcessorWithQueue {
    public static void main(String[] args) {
        // Simulated inverted index
        Map<String, Set<String>> invertedIndex = Map.of(
            "java", Set.of("doc1", "doc3"),
            "python", Set.of("doc2", "doc3"),
            "programming", Set.of("doc1", "doc2")
        );

        // Queue of query terms in the order they were entered
        Queue<String> queryTerms = new LinkedList<>(List.of("java", "programming"));

        // Result set to hold document IDs that match ALL terms (AND search)
        Set<String> resultSet = null;

        while (!queryTerms.isEmpty()) {
            String term = queryTerms.poll();
            Set<String> docs = invertedIndex.getOrDefault(term, Set.of());

            if (resultSet == null) {
                resultSet = new HashSet<>(docs);
            } else {
                // Retain only documents that appear in both resultSet and current term's set
                resultSet.retainAll(docs);
            }
        }

        System.out.println("Matching documents: " + resultSet);
    }
}

Expected Output

Matching documents: [doc1]

Explanation: Both "java" and "programming" are present only in doc1.

Using Lists for Ordered Results

When presenting search results, a List is often used because:

Here’s a simple example that ranks documents by number of matching terms:

import java.util.*;

public class RankedResultWithList {
    public static void main(String[] args) {
        Map<String, Set<String>> invertedIndex = Map.of(
            "java", Set.of("doc1", "doc3"),
            "python", Set.of("doc2", "doc3"),
            "programming", Set.of("doc1", "doc2", "doc3")
        );

        List<String> queryTerms = List.of("java", "python", "programming");

        // Score map: document -> count of matched terms
        Map<String, Integer> docScores = new HashMap<>();

        for (String term : queryTerms) {
            Set<String> docs = invertedIndex.getOrDefault(term, Set.of());
            for (String doc : docs) {
                docScores.put(doc, docScores.getOrDefault(doc, 0) + 1);
            }
        }

        // Create a list of results and sort by score descending
        List<Map.Entry<String, Integer>> results = new ArrayList<>(docScores.entrySet());
        results.sort((a, b) -> b.getValue() - a.getValue());

        // Print ordered results
        System.out.println("Ranked Results:");
        for (Map.Entry<String, Integer> entry : results) {
            System.out.println(entry.getKey() + " (score: " + entry.getValue() + ")");
        }
    }
}

Expected Output

Ranked Results:
doc3 (score: 3)
doc1 (score: 2)
doc2 (score: 2)

Summary

In more advanced search engines, these data structures may work with priority queues (for top-k results), trees (for prefix searches), or graphs (for document linking), but Lists and Queues form the essential processing backbone.

Index