"Doug Cutting, an experienced developer of text-search and retrieval tools, created Lucene. Cutting is the primary author of the V-Twin search engine (part of Apple's Copland operating system effort) and is currently a senior architect at Excite. He designed Lucene to make it easy to add indexing and search capability to a broad range of applications, including:"
"Although Lucene is well known for its full-text indexing, many developers are less aware that it can also provide powerful complementary searching, filtering, and sorting functionalities. Indeed, many searches involve combining full-text searches with filters on different fields or criteria. For example, you may want to search a database of books or articles using a full-text search, but with the possibility to limit the results to certain types of books. Traditionally, this type of criteria-based searching is in the realm of the relational database. However, Lucene offers numerous powerful features that let you efficiently combine full-text searches with criteria-based searches and sorts."
"Lucene is a free text-indexing and -searching API written in Java. To appreciate indexing techniques described later in this article, you need a basic understanding of Lucene's index structure. As I mentioned in the previous article in this series, a typical Lucene index is stored in a single directory in the filesystem on a hard disk."
"In this article I propose the approach of using Lucene, the Java-based open source search engine, to search source code by extracting and indexing relevant source code elements. I restrict the search to Java source code only. However, extending the search to any other programming language's source code should not be very different."
"While we were looking into Google, I was also looking at Lucene. Lucene has always interested me, as it isn?t a typical open source project. In my experience, most open source projects are frameworks that have evolved. Take something like Struts. Before Struts many people were rolling their own MVC layers on top of Servlets/JSPs. It made sense to not have to reinvent this wheel, so Struts came around."
"The authors review and condense a wide range of Lucene?s core material, including indexing, searching, analysis (tokenization), sorting/filtering, span queries, term vectors and other advanced search techniques; as well as applied Lucene material, such as document parsing, tools and extensions, Lucene ports and case studies, to put together the most comprehensive and authoritative guide to Lucene ever published."
"All modern search engines attempt to detect and correct spelling errors in users' search queries. Google, for example, was one of the first to offer such a facility, and today we barely notice when we are asked "Did you mean x?" after a slip on the keyboard. This article shows you one way of adding a "did you mean" suggestion facility to your own search applications using the Lucene Spell Checker, an extension written by Nicolas Maisonneuve and David Spencer."
"Editor's note: We are rerunning this Introduction to Lucene that originally ran in July 2003 in honor of the publication of "Lucene in action" by Otis Gospodnetic and Erik Hatcher. To see an example of Lucene in action, take a look at Erik's www.lucenebook.com site."
"Nutch installations typically operate at one of three scales: local filesystem, intranet, or whole web. All three have different characteristics. For instance, crawling a local filesystem is reliable compared to the other two, since network errors don't occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole-web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we re-crawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search--how do we cope with hundreds of concurrent queries on such a large dataset? Building a whole-web search engine is a major investment. In " Building Nutch: Open Source Search," authors Mike Cafarella and Doug Cutting (the prime movers behind Nutch) conclude that:"
"In this two-article series, we introduced Nutch and discovered how to crawl a small collection of websites and run a Nutch search engine using the results of the crawl. We covered the basics of Nutch, but there are many other aspects to explore, such as the numerous plugins available to customize your setup, the tools for maintaining the search index (type bin/nutch to get a list), or even whole-web crawling and searching. Possibly the best thing about Nutch, though, is its vibrant user and developer community, which is continually coming up with new ideas and ways to do all things search-related."
"Enter Lucene. I'll presume you've heard of it at least, if not used it. Lucene does full text indexing, and that is it. It does this really well. The beauty (well, one) is that you can index anything. In this case, I'll index an object being persisted by OJB. The key is to embed information required to retrieve the document being indexed."
"The second example in this article shows a (reasonably) practical Web application. You can add any number of Word, PowerPoint, PDF, OpenOffice.org, HTML, and AbiWord documents to a directory of your choice. When the Web application starts up, it indexes all documents in this specified directory and lets users search these documents and download the documents found in the search process. Review Figure 1 and Figure 2 before implementing this example."
"This article introduces you to the indexing mechanism of Lucene, a popular full-text IR library written in the Java language. First, I'll demonstrate how to index your documents with Lucene, then I'll discuss how to improve the indexing performance. Finally, I'll analyze Lucene's index file structure. Keep in mind that Lucene is not a ready-to-use application, but rather an IR Library that lets you add searching and indexing functionality to your application."
"If you've ever wanted to parse XML documents but have found SAX just a little difficult, this article is for you. In this article, we examine how to use two open source tools from the Apache Jakarta project, Commons Digester and Lucene, to handle the parsing, indexing, and searching of XML documents. Digester parses the XML data, and Lucene handles indexing and searching. You'll first see how to use each tool on its own and then how to use them together, with sample code that you can compile and run."
"In this article, you learn to implement advanced searches with Lucene, as well as how to build a sample Web search application that integrates with Lucene. The end result will be that you create your own Web search application with this open source work horse."
"Erik Hatcher codes, writes, and speaks on technical topics that he finds fun and challenging. He has written software for a number of diverse industries using many diffedifferentnologies and languages. Erik coauthored Java Development with Ant (Manning, 2002) with Steve Loughran, a book that has received wonderful industry acclaim. Since the release of Erik's first book, he has spoken at numerous venues including the No Fluff, Just Stuff symposium circuit, JavaOne, O'Reilly's Open Source Convention, the Open Source Content Management Conference, and many Java User Group meetings. As an Apache Software Foundation member, he is an active contributor and committer on several Apache projects including Lucene, Ant, and Tapestry. Erik currently works at the University of Virginia's Humanities department supporting Applied Research in Patacriticism."
"Lucene is a high performance, scalable Information Retrieval (IR) library. It lets you add indexing and searching capabilities to your applications. Lucene is a mature, free, open-source project implemented in Java; it's a member of the popular Apache Jakarta family of projects, licensed under the liberal Apache Software License. As such, Lucene is currently, and has been for a few years, the most popular free Java IR library."
"Lucene is a powerful and elegant library for full-text indexing and searching in Java. In this article, we go through some Lucene basics, by adding simple yet powerful full-text index and search functions to a typical J2EE web application."