Apache Lucene is a high-performance, full-featured text search engine library. Here's a simple example how to use Lucene for indexing and searching (using JUnit to check if the results are what we expect):
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); // Store the index in memory: Directory directory = new RAMDirectory(); // To store an index on disk, use this instead: //Directory directory = FSDirectory.open("/tmp/testindex"); IndexWriter iwriter = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000)); Document doc = new Document(); String text = "This is the text to be indexed."; doc.add(new Field("fieldname", text, Field.Store.YES, Field.Index.ANALYZED)); iwriter.addDocument(doc); iwriter.close(); // Now search the index: IndexReader ireader = IndexReader.open(directory); // read-only=true IndexSearcher isearcher = new IndexSearcher(ireader); // Parse a simple query that searches for "text": QueryParser parser = new QueryParser("fieldname", analyzer); Query query = parser.parse("text"); ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs; assertEquals(1, hits.length); // Iterate through the results: for (int i = 0; i < hits.length; i++) { Document hitDoc = isearcher.doc(hits[i].doc); assertEquals("This is the text to be indexed.", hitDoc.get("fieldname")); } isearcher.close(); ireader.close(); directory.close();
The Lucene API is divided into several packages:
- org.apache.lucene.analysis defines an abstract Analyzer API for converting text from a java.io.Reader into a TokenStream, an enumeration of token Attributes. A TokenStream can be composed by applying TokenFilters to the output of a Tokenizer. Tokenizers and TokenFilters are strung together and applied with an Analyzer. A handful of Analyzer implementations are provided, including StopAnalyzer and the grammar-based StandardAnalyzer.
- org.apache.lucene.document provides a simple Document class. A Document is simply a set of named Fields, whose values may be strings or instances of java.io.Reader.
- org.apache.lucene.index provides two primary classes: IndexWriter, which creates and adds documents to indices; and IndexReader, which accesses the data in the index.
- org.apache.lucene.search provides data structures to represent queries (ie TermQuery for individual words, PhraseQuery for phrases, and BooleanQuery for boolean combinations of queries) and the abstract Searcher which turns queries into TopDocs. IndexSearcher implements search over a single IndexReader.
- org.apache.lucene.queryParser uses JavaCC to implement a QueryParser.
- org.apache.lucene.store defines an abstract class for storing persistent data, the Directory, which is a collection of named files written by an IndexOutput and read by an IndexInput. Multiple implementations are provided, including FSDirectory, which uses a file system directory to store files, and RAMDirectory which implements files as memory-resident data structures.
- org.apache.lucene.util contains a few handy data structures and util classes, ie BitVector and PriorityQueue.
- Create Documents by adding Fields;
- Create an IndexWriter and add documents to it with addDocument();
- Call QueryParser.parse() to build a query from a string; and
- Create an IndexSearcher and pass the query to its search() method.
- IndexFiles.java creates an index for all the files contained in a directory.
- SearchFiles.java prompts for queries and searches an index.
> java -cp lucene.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.IndexFiles rec.food.recipes/soups
adding rec.food.recipes/soups/abalone-chowder
[ ... ]> java -cp lucene.jar:lucene-demo.jar:lucene-analyzers-common.jar org.apache.lucene.demo.SearchFiles
Query: chowder
Searching for: chowder
34 total matching documents
1. rec.food.recipes/soups/spam-chowder
[ ... thirty-four documents contain the word "chowder" ... ]Query: "clam chowder" AND Manhattan
Searching for: +"clam chowder" +manhattan
2 total matching documents
1. rec.food.recipes/soups/clam-chowder
[ ... two documents contain the phrase "clam chowder" and the word "manhattan" ... ]
[ Note: "+" and "-" are canonical, but "AND", "OR" and "NOT" may be used. ]
Package | Description |
---|---|
org.apache.lucene |
Top-level package.
|
org.apache.lucene.analysis |
API and code to convert text into indexable/searchable tokens.
|
org.apache.lucene.analysis.ar |
Analyzer for Arabic.
|
org.apache.lucene.analysis.bg |
Analyzer for Bulgarian.
|
org.apache.lucene.analysis.br |
Analyzer for Brazilian Portuguese.
|
org.apache.lucene.analysis.ca |
Analyzer for Catalan.
|
org.apache.lucene.analysis.charfilter |
CharFilters: process text before the Tokenizer
|
org.apache.lucene.analysis.cjk |
Analyzer for Chinese, Japanese, and Korean, which indexes bigrams (overlapping groups of two adjacent Han characters).
|
org.apache.lucene.analysis.cn |
Analyzer for Chinese, which indexes unigrams (individual chinese characters).
|
org.apache.lucene.analysis.cn.smart |
Analyzer for Simplified Chinese, which indexes words.
|
org.apache.lucene.analysis.cn.smart.hhmm |
SmartChineseAnalyzer Hidden Markov Model package.
|
org.apache.lucene.analysis.compound |
A filter that decomposes compound words you find in many Germanic
languages into the word parts.
|
org.apache.lucene.analysis.compound.hyphenation |
The code for the compound word hyphenation is taken from the Apache FOP project.
|
org.apache.lucene.analysis.cz |
Analyzer for Czech.
|
org.apache.lucene.analysis.da |
Analyzer for Danish.
|
org.apache.lucene.analysis.de |
Analyzer for German.
|
org.apache.lucene.analysis.el |
Analyzer for Greek.
|
org.apache.lucene.analysis.en |
Analyzer for English.
|
org.apache.lucene.analysis.es |
Analyzer for Spanish.
|
org.apache.lucene.analysis.eu |
Analyzer for Basque.
|
org.apache.lucene.analysis.fa |
Analyzer for Persian.
|
org.apache.lucene.analysis.fi |
Analyzer for Finnish.
|
org.apache.lucene.analysis.fr |
Analyzer for French.
|
org.apache.lucene.analysis.ga |
Analysis for Irish.
|
org.apache.lucene.analysis.gl |
Analyzer for Galician.
|
org.apache.lucene.analysis.hi |
Analyzer for Hindi.
|
org.apache.lucene.analysis.hu |
Analyzer for Hungarian.
|
org.apache.lucene.analysis.hunspell |
Stemming TokenFilter using a Java implementation of the
Hunspell stemming algorithm.
|
org.apache.lucene.analysis.hy |
Analyzer for Armenian.
|
org.apache.lucene.analysis.icu |
Analysis components based on ICU
|
org.apache.lucene.analysis.icu.segmentation |
Tokenizer that breaks text into words with the Unicode Text Segmentation algorithm.
|
org.apache.lucene.analysis.icu.tokenattributes |
Additional ICU-specific Attributes for text analysis.
|
org.apache.lucene.analysis.id |
Analyzer for Indonesian.
|
org.apache.lucene.analysis.in |
Analysis components for Indian languages.
|
org.apache.lucene.analysis.it |
Analyzer for Italian.
|
org.apache.lucene.analysis.ja |
Analyzer for Japanese.
|
org.apache.lucene.analysis.ja.dict |
Kuromoji dictionary implementation.
|
org.apache.lucene.analysis.ja.tokenattributes |
Additional Kuromoji-specific Attributes for text analysis.
|
org.apache.lucene.analysis.ja.util |
Kuromoji utility classes.
|
org.apache.lucene.analysis.lv |
Analyzer for Latvian.
|
org.apache.lucene.analysis.miscellaneous |
Miscellaneous TokenStreams
|
org.apache.lucene.analysis.ngram |
Character n-gram tokenizers and filters.
|
org.apache.lucene.analysis.nl |
Analyzer for Dutch.
|
org.apache.lucene.analysis.no |
Analyzer for Norwegian.
|
org.apache.lucene.analysis.path |
Analysis components for path-like strings such as filenames.
|
org.apache.lucene.analysis.payloads |
Provides various convenience classes for creating payloads on Tokens.
|
org.apache.lucene.analysis.phonetic |
Analysis components for phonetic search.
|
org.apache.lucene.analysis.pl |
Analyzer for Polish.
|
org.apache.lucene.analysis.position |
Filter for assigning position increments.
|
org.apache.lucene.analysis.pt |
Analyzer for Portuguese.
|
org.apache.lucene.analysis.query |
Automatically filter high-frequency stopwords.
|
org.apache.lucene.analysis.reverse |
Filter to reverse token text.
|
org.apache.lucene.analysis.ro |
Analyzer for Romanian.
|
org.apache.lucene.analysis.ru |
Analyzer for Russian.
|
org.apache.lucene.analysis.shingle |
Word n-gram filters
|
org.apache.lucene.analysis.sinks |
Implementations of the SinkTokenizer that might be useful.
|
org.apache.lucene.analysis.snowball |
TokenFilter and Analyzer implementations that use Snowball
stemmers. |
org.apache.lucene.analysis.standard |
Standards-based analyzers implemented with JFlex.
|
org.apache.lucene.analysis.standard.std31 |
Backwards-compatible implementation to match
Version.LUCENE_31 |
org.apache.lucene.analysis.standard.std34 |
Backwards-compatible implementation to match
Version.LUCENE_34 |
org.apache.lucene.analysis.stempel |
Stempel: Algorithmic Stemmer
|
org.apache.lucene.analysis.sv |
Analyzer for Swedish.
|
org.apache.lucene.analysis.synonym |
Analysis components for Synonyms.
|
org.apache.lucene.analysis.th |
Analyzer for Thai.
|
org.apache.lucene.analysis.tokenattributes |
Useful
Attribute s for text analysis. |
org.apache.lucene.analysis.tr |
Analyzer for Turkish.
|
org.apache.lucene.analysis.util |
Utility functions for text analysis.
|
org.apache.lucene.analysis.wikipedia |
Tokenizer that is aware of Wikipedia syntax.
|
org.apache.lucene.benchmark | |
org.apache.lucene.benchmark.byTask |
Benchmarking Lucene By Tasks.
|
org.apache.lucene.benchmark.byTask.feeds |
Sources for benchmark inputs: documents and queries.
|
org.apache.lucene.benchmark.byTask.feeds.demohtml |
Example html parser based on JavaCC
|
org.apache.lucene.benchmark.byTask.programmatic |
Sample performance test written programmatically - no algorithm file is needed here.
|
org.apache.lucene.benchmark.byTask.stats |
Statistics maintained when running benchmark tasks.
|
org.apache.lucene.benchmark.byTask.tasks |
Extendable benchmark tasks.
|
org.apache.lucene.benchmark.byTask.utils |
Utilities used for the benchmark, and for the reports.
|
org.apache.lucene.benchmark.quality |
Search Quality Benchmarking.
|
org.apache.lucene.benchmark.quality.trec |
Utilities for Trec related quality benchmarking, feeding from Trec Topics and QRels inputs.
|
org.apache.lucene.benchmark.quality.utils |
Miscellaneous utilities for search quality benchmarking: query parsing, submission reports.
|
org.apache.lucene.benchmark.utils |
Benchmark Utility functions.
|
org.apache.lucene.collation |
CollationKeyFilter
converts each token into its binary CollationKey using the
provided Collator , and then encode the CollationKey
as a String using
IndexableBinaryStringTools , to allow it to be
stored as an index term. |
org.apache.lucene.demo |
Demo applications for indexing and searching.
|
org.apache.lucene.document |
The logical representation of a
Document for indexing and searching. |
org.apache.lucene.facet |
Provides faceted indexing and search capabilities.
|
org.apache.lucene.facet.enhancements |
Enhanced category features
|
org.apache.lucene.facet.enhancements.association |
Association category enhancements
|
org.apache.lucene.facet.enhancements.params |
Enhanced category features
|
org.apache.lucene.facet.index |
Indexing of document categories
|
org.apache.lucene.facet.index.attributes |
Category attributes and their properties for indexing
|
org.apache.lucene.facet.index.categorypolicy |
Policies for indexing categories
|
org.apache.lucene.facet.index.params |
Indexing-time specifications for handling facets
|
org.apache.lucene.facet.index.streaming |
Expert: attributes streaming definition for indexing facets
|
org.apache.lucene.facet.search |
Faceted Search API
|
org.apache.lucene.facet.search.aggregator |
Aggregating Facets during Faceted Search
|
org.apache.lucene.facet.search.aggregator.association |
Association-based aggregators.
|
org.apache.lucene.facet.search.cache |
Caching to speed up facets accumulation.
|
org.apache.lucene.facet.search.params |
Parameters for Faceted Search
|
org.apache.lucene.facet.search.params.association |
Association-based Parameters for Faceted Search.
|
org.apache.lucene.facet.search.results |
Results of Faceted Search
|
org.apache.lucene.facet.search.sampling |
Sampling for facets accumulation
|
org.apache.lucene.facet.taxonomy |
Taxonomy of Categories
|
org.apache.lucene.facet.taxonomy.directory |
Taxonomy implemented using a Lucene-Index
|
org.apache.lucene.facet.taxonomy.writercache |
Improves indexing time by caching a map of CategoryPath to their Ordinal
|
org.apache.lucene.facet.taxonomy.writercache.cl2o |
Category->Ordinal caching implementation using an optimized data-structures
|
org.apache.lucene.facet.taxonomy.writercache.lru |
An LRU cache implementation for the CategoryPath to Ordinal map
|
org.apache.lucene.facet.util |
Various utilities for faceted search
|
org.apache.lucene.index |
Code to maintain and access indices.
|
org.apache.lucene.index.memory |
High-performance single-document main memory Apache Lucene fulltext search index.
|
org.apache.lucene.index.pruning |
Static Index Pruning Tools
|
org.apache.lucene.messages |
For Native Language Support (NLS), system of software internationalization.
|
org.apache.lucene.misc |
Miscellaneous index tools.
|
org.apache.lucene.queryParser |
A simple query parser implemented with JavaCC.
|
org.apache.lucene.queryParser.analyzing |
QueryParser that passes Fuzzy-, Prefix-, Range-, and WildcardQuerys through the given analyzer.
|
org.apache.lucene.queryParser.complexPhrase |
QueryParser which permits complex phrase query syntax eg "(john jon jonathan~) peters*"
|
org.apache.lucene.queryParser.core |
Contains the core classes of the flexible query parser framework
|
org.apache.lucene.queryParser.core.builders |
Contains the necessary classes to implement query builders
|
org.apache.lucene.queryParser.core.config |
Contains the base classes used to configure the query processing
|
org.apache.lucene.queryParser.core.messages |
Contains messages usually used by query parser implementations
|
org.apache.lucene.queryParser.core.nodes |
Contains query nodes that are commonly used by query parser implementations
|
org.apache.lucene.queryParser.core.parser |
Contains the necessary interfaces to implement text parsers
|
org.apache.lucene.queryParser.core.processors |
Interfaces and implementations used by query node processors
|
org.apache.lucene.queryParser.core.util |
Utility classes to used with the Query Parser
|
org.apache.lucene.queryParser.ext |
Extendable QueryParser provides a simple and flexible extension mechanism by overloading query field names.
|
org.apache.lucene.queryParser.precedence |
This package contains the Precedence Query Parser Implementation
|
org.apache.lucene.queryParser.precedence.processors |
This package contains the processors used by Precedence Query Parser
|
org.apache.lucene.queryParser.standard |
Contains the implementation of the Lucene query parser using the flexible query parser frameworks
|
org.apache.lucene.queryParser.standard.builders |
Standard Lucene Query Node Builders
|
org.apache.lucene.queryParser.standard.config |
Standard Lucene Query Configuration
|
org.apache.lucene.queryParser.standard.nodes |
Standard Lucene Query Nodes
|
org.apache.lucene.queryParser.standard.parser |
Lucene Query Parser
|
org.apache.lucene.queryParser.standard.processors |
Lucene Query Node Processors
|
org.apache.lucene.queryParser.surround.parser |
This package contains the QueryParser.jj source file for the Surround parser.
|
org.apache.lucene.queryParser.surround.query |
This package contains SrndQuery and its subclasses.
|
org.apache.lucene.search |
Code to search indices.
|
org.apache.lucene.search.function |
Programmatic control over documents scores.
|
org.apache.lucene.search.grouping |
This module enables search result grouping with Lucene, where hits
with the same value in the specified single-valued group field are
grouped together.
|
org.apache.lucene.search.highlight |
The highlight package contains classes to provide "keyword in context" features
typically used to highlight search terms in the text of results pages.
|
org.apache.lucene.search.join |
This modules support index-time and query-time joins.
|
org.apache.lucene.search.payloads |
The payloads package provides Query mechanisms for finding and using payloads.
|
org.apache.lucene.search.regex |
Regular expression Query.
|
org.apache.lucene.search.similar |
Document similarity query generators.
|
org.apache.lucene.search.spans |
The calculus of spans.
|
org.apache.lucene.search.spell |
Suggest alternate spellings for words.
|
org.apache.lucene.search.suggest |
Support for Autocomplete/Autosuggest
|
org.apache.lucene.search.suggest.fst |
Finite-state based autosuggest.
|
org.apache.lucene.search.suggest.jaspell |
JaSpell-based autosuggest.
|
org.apache.lucene.search.suggest.tst |
Ternary Search Tree based autosuggest.
|
org.apache.lucene.search.vectorhighlight |
This is an another highlighter implementation.
|
org.apache.lucene.spatial |
Support for geospatial search.
|
org.apache.lucene.spatial.geohash |
Support for Geohash encoding, decoding, and filtering.
|
org.apache.lucene.spatial.geometry |
Coordinate and distance representations.
|
org.apache.lucene.spatial.geometry.shape |
Shape representations.
|
org.apache.lucene.spatial.tier |
Support for filtering based upon geographic location.
|
org.apache.lucene.spatial.tier.projections |
Spatial projections.
|
org.apache.lucene.store |
Binary i/o API, used for all index data.
|
org.apache.lucene.store.instantiated |
InstantiatedIndex, alternative RAM store for small corpora.
|
org.apache.lucene.util |
Some utility classes.
|
org.apache.lucene.util.collections |
Various optimized Collections implementations.
|
org.apache.lucene.util.encoding |
Offers various encoders and decoders for integers, as well as the
mechanisms to create new ones.
|
org.apache.lucene.util.fst |
Finite state transducers
|
org.apache.lucene.util.packed |
The packed package provides random access capable arrays of positive longs.
|
org.apache.lucene.xmlparser |
Parser that produces Lucene Query objects from XML streams.
|
org.apache.lucene.xmlparser.builders |
Builders to support various Lucene queries.
|
org.egothor.stemmer |
Egothor stemmer API.
|
org.tartarus.snowball |
Snowball stemmer API.
|
org.tartarus.snowball.ext |
Autogenerated snowball stemmer implementations.
|