Code to maintain and access indices.

Table Of Contents

  1. Postings APIs
  2. Index Statistics

Postings APIs

Fields

{@link org.apache.lucene.index.Fields} is the initial entry point into the postings APIs, this can be obtained in several ways:

// access indexed fields for an index segment
Fields fields = reader.fields();
// access term vector fields for a specified document
Fields fields = reader.getTermVectors(docid);
Fields implements Java's Iterable interface, so its easy to enumerate the list of fields:
// enumerate list of fields
for (String field : fields) {
  // access the terms for this field
  Terms terms = fields.terms(field);
}

Terms

{@link org.apache.lucene.index.Terms} represents the collection of terms within a field, exposes some metadata and statistics, and an API for enumeration.

// metadata about the field
System.out.println("positions? " + terms.hasPositions());
System.out.println("offsets? " + terms.hasOffsets());
System.out.println("payloads? " + terms.hasPayloads());
// iterate through terms
TermsEnum termsEnum = terms.iterator(null);
BytesRef term = null;
while ((term = termsEnum.next()) != null) {
  doSomethingWith(termsEnum.term());
}
{@link org.apache.lucene.index.TermsEnum} provides an iterator over the list of terms within a field, some statistics about the term, and methods to access the term's documents and positions.
// seek to a specific term
boolean found = termsEnum.seekExact(new BytesRef("foobar"), true);
if (found) {
  // get the document frequency
  System.out.println(termsEnum.docFreq());
  // enumerate through documents
  DocsEnum docs = termsEnum.docs(null, null);
  // enumerate through documents and positions
  DocsAndPositionsEnum docsAndPositions = termsEnum.docsAndPositions(null, null);
}

Documents

{@link org.apache.lucene.index.DocsEnum} is an extension of {@link org.apache.lucene.search.DocIdSetIterator}that iterates over the list of documents for a term, along with the term frequency within that document.

int docid;
while ((docid = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  System.out.println(docsEnum.freq());
}

Positions

{@link org.apache.lucene.index.DocsAndPositionsEnum} is an extension of {@link org.apache.lucene.index.DocsEnum} that additionally allows iteration of the positions a term occurred within the document, and any additional per-position information (offsets and payload)

int docid;
while ((docid = docsAndPositionsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
  System.out.println(docid);
  int freq = docsAndPositionsEnum.freq();
  for (int i = 0; i < freq; i++) {
     System.out.println(docsAndPositionsEnum.nextPosition());
     System.out.println(docsAndPositionsEnum.startOffset());
     System.out.println(docsAndPositionsEnum.endOffset());
     System.out.println(docsAndPositionsEnum.getPayload());
  }
}

Index Statistics

Term statistics

Field statistics

Segment statistics

Document statistics

Document statistics are available during the indexing process for an indexed field: typically a {@link org.apache.lucene.search.similarities.Similarity} implementation will store some of these values (possibly in a lossy way), into the normalization value for the document in its {@link org.apache.lucene.search.similarities.Similarity#computeNorm} method.

Additional user-supplied statistics can be added to the document as DocValues fields and accessed via {@link org.apache.lucene.index.AtomicReader#getNumericDocValues}.