Indexing is the foundational step in building any search engine. It transforms raw text into a searchable structure known as an inverted index, which maps words (terms) to the documents or positions where they occur. This structure allows fast lookups during search queries.
An inverted index is a Map<String, Set<String>>
, where:
Set
of document identifiers (to avoid duplicates).This is in contrast to a "forward index," which maps documents to the words they contain.
Before indexing, input text should undergo:
Below is a basic implementation that indexes a few text documents.
import java.util.*;
public class InvertedIndexExample {
public static void main(String[] args) {
// Sample documents with their IDs
Map<String, String> documents = Map.of(
"doc1", "Java is a high-level programming language.",
"doc2", "Python is a popular programming language.",
"doc3", "Java and Python are used in many applications."
);
// Inverted index: word -> set of document IDs
Map<String, Set<String>> invertedIndex = new HashMap<>();
// Build the index
for (Map.Entry<String, String> entry : documents.entrySet()) {
String docId = entry.getKey();
String content = entry.getValue();
// Tokenization and normalization
String[] words = content.toLowerCase().replaceAll("[^a-z ]", "").split("\\s+");
for (String word : words) {
// Add document ID to the set of documents for this word
invertedIndex
.computeIfAbsent(word, k -> new HashSet<>())
.add(docId);
}
}
// Display the inverted index
System.out.println("Inverted Index:");
for (Map.Entry<String, Set<String>> entry : invertedIndex.entrySet()) {
System.out.println(entry.getKey() + " -> " + entry.getValue());
}
}
}
Inverted Index:
java -> [doc1, doc3]
is -> [doc1, doc2]
a -> [doc1, doc2]
highlevel -> [doc1]
programming -> [doc1, doc2]
language -> [doc1, doc2]
python -> [doc2, doc3]
popular -> [doc2]
and -> [doc3]
are -> [doc3]
used -> [doc3]
in -> [doc3]
many -> [doc3]
applications -> [doc3]
Set
for document IDs ensures no duplicates, which is essential when the same word appears multiple times in the same document.computeIfAbsent()
, a convenient method introduced in Java 8 for map initialization.In real-world systems, inverted indexes often store:
Using Map
and Set
, we can create a simple but functional inverted index for document search. This forms the basis for query processing, relevance ranking, and more advanced search engine features. The next step is to query this index efficiently — covered in the next section.
Once documents are indexed, the next step is processing search queries. This involves parsing user input, retrieving relevant documents from the inverted index, ranking them (optionally), and returning results. In Java, Queues
and Lists
play important roles in this stage.
Queue
is ideal for processing query terms in First-In-First-Out (FIFO) order. In a real-world system, search terms might be parsed and placed in a queue for processing, especially when queries are compound (e.g., "Java AND Python").
import java.util.*;
public class QueryProcessorWithQueue {
public static void main(String[] args) {
// Simulated inverted index
Map<String, Set<String>> invertedIndex = Map.of(
"java", Set.of("doc1", "doc3"),
"python", Set.of("doc2", "doc3"),
"programming", Set.of("doc1", "doc2")
);
// Queue of query terms in the order they were entered
Queue<String> queryTerms = new LinkedList<>(List.of("java", "programming"));
// Result set to hold document IDs that match ALL terms (AND search)
Set<String> resultSet = null;
while (!queryTerms.isEmpty()) {
String term = queryTerms.poll();
Set<String> docs = invertedIndex.getOrDefault(term, Set.of());
if (resultSet == null) {
resultSet = new HashSet<>(docs);
} else {
// Retain only documents that appear in both resultSet and current term's set
resultSet.retainAll(docs);
}
}
System.out.println("Matching documents: " + resultSet);
}
}
Matching documents: [doc1]
Explanation: Both "java" and "programming" are present only in doc1
.
When presenting search results, a List
is often used because:
Here’s a simple example that ranks documents by number of matching terms:
import java.util.*;
public class RankedResultWithList {
public static void main(String[] args) {
Map<String, Set<String>> invertedIndex = Map.of(
"java", Set.of("doc1", "doc3"),
"python", Set.of("doc2", "doc3"),
"programming", Set.of("doc1", "doc2", "doc3")
);
List<String> queryTerms = List.of("java", "python", "programming");
// Score map: document -> count of matched terms
Map<String, Integer> docScores = new HashMap<>();
for (String term : queryTerms) {
Set<String> docs = invertedIndex.getOrDefault(term, Set.of());
for (String doc : docs) {
docScores.put(doc, docScores.getOrDefault(doc, 0) + 1);
}
}
// Create a list of results and sort by score descending
List<Map.Entry<String, Integer>> results = new ArrayList<>(docScores.entrySet());
results.sort((a, b) -> b.getValue() - a.getValue());
// Print ordered results
System.out.println("Ranked Results:");
for (Map.Entry<String, Integer> entry : results) {
System.out.println(entry.getKey() + " (score: " + entry.getValue() + ")");
}
}
}
Ranked Results:
doc3 (score: 3)
doc1 (score: 2)
doc2 (score: 2)
Queue
ensures query terms are processed in order, especially useful for AND/OR evaluation.List
is suited for returning and sorting results, supporting rich operations like pagination and ranking.In more advanced search engines, these data structures may work with priority queues (for top-k results), trees (for prefix searches), or graphs (for document linking), but Lists and Queues form the essential processing backbone.