{@link org.apache.lucene.index.Fields} is the initial entry point into the postings APIs, this can be obtained in several ways:
// access indexed fields for an index segment Fields fields = reader.fields(); // access term vector fields for a specified document Fields fields = reader.getTermVectors(docid);Fields implements Java's Iterable interface, so its easy to enumerate the list of fields:
// enumerate list of fields for (String field : fields) { // access the terms for this field Terms terms = fields.terms(field); }
{@link org.apache.lucene.index.Terms} represents the collection of terms within a field, exposes some metadata and statistics, and an API for enumeration.
// metadata about the field System.out.println("positions? " + terms.hasPositions()); System.out.println("offsets? " + terms.hasOffsets()); System.out.println("payloads? " + terms.hasPayloads()); // iterate through terms TermsEnum termsEnum = terms.iterator(null); BytesRef term = null; while ((term = termsEnum.next()) != null) { doSomethingWith(termsEnum.term()); }{@link org.apache.lucene.index.TermsEnum} provides an iterator over the list of terms within a field, some statistics about the term, and methods to access the term's documents and positions.
// seek to a specific term boolean found = termsEnum.seekExact(new BytesRef("foobar"), true); if (found) { // get the document frequency System.out.println(termsEnum.docFreq()); // enumerate through documents DocsEnum docs = termsEnum.docs(null, null); // enumerate through documents and positions DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null); }
{@link org.apache.lucene.index.DocsEnum} is an extension of {@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of documents for a term, along with the term frequency within that document.
int docid; while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); System.out.println(docsEnum.freq()); }
{@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of {@link org.apache.lucene.index.DocsEnum} that additionally allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload)
int docid; while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) { System.out.println(docid); int freq = docsAndPositionsEnum.freq(); for (int i = 0; i < freq; i++) { System.out.println(docsAndPositionsEnum.nextPosition()); System.out.println(docsAndPositionsEnum.startOffset()); System.out.println(docsAndPositionsEnum.endOffset()); System.out.println(docsAndPositionsEnum.getPayload()); } }
-1
) if term frequencies were omitted
from the index
({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY})
for the field. Like docFreq(), it will also count occurrences that appear in
deleted documents.
-1
) for some Terms implementations such as
{@link org.apache.lucene.index.MultiTerms}, where it cannot be efficiently
computed. Note that this count also includes terms that appear only
in deleted documents: when segments are merged such terms are also merged
away and the statistic is then updated.
-1
) if term
frequencies were omitted from the index
({@link org.apache.lucene.index.FieldInfo.IndexOptions#DOCS_ONLY DOCS_ONLY})
for the field.
Document statistics are available during the indexing process for an indexed field: typically a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some of these values (possibly in a lossy way), into the normalization value for the document in its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.
Additional user-supplied statistics can be added to the document as DocValues fields and accessed via {@link org.apache.lucene.index.AtomicReader#getNumericDocValues}.