Package org.apache.lucene.analysis.query
Class QueryAutoStopWordAnalyzer
- java.lang.Object
-
- org.apache.lucene.analysis.Analyzer
-
- org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer
-
- All Implemented Interfaces:
Closeable
,AutoCloseable
public final class QueryAutoStopWordAnalyzer extends org.apache.lucene.analysis.Analyzer
AnAnalyzer
used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
Use the various "addStopWords" methods in this class to automate the identification and addition of stop words found in an already existing index.
-
-
Field Summary
Fields Modifier and Type Field Description static float
defaultMaxDocFreqPercent
-
Constructor Summary
Constructors Constructor Description QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate)
Deprecated.Stopwords should be calculated at instantiation using one of the other constructorsQueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater thandefaultMaxDocFreqPercent
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocsQueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreqQueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, Collection<String> fields, float maxPercentDocs)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocsQueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, Collection<String> fields, int maxDocFreq)
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
-
Method Summary
All Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description int
addStopWords(org.apache.lucene.index.IndexReader reader)
Deprecated.Stopwords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader)
int
addStopWords(org.apache.lucene.index.IndexReader reader, float maxPercentDocs)
Deprecated.Stowords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, float)
int
addStopWords(org.apache.lucene.index.IndexReader reader, int maxDocFreq)
Deprecated.Stopwords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, int)
int
addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, float maxPercentDocs)
Deprecated.Stowords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, float)
int
addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, int maxDocFreq)
Deprecated.Stowords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, int)
org.apache.lucene.index.Term[]
getStopWords()
Provides information on which stop words have been identified for all fieldsString[]
getStopWords(String fieldName)
Provides information on which stop words have been identified for a fieldorg.apache.lucene.analysis.TokenStream
reusableTokenStream(String fieldName, Reader reader)
org.apache.lucene.analysis.TokenStream
tokenStream(String fieldName, Reader reader)
-
-
-
Field Detail
-
defaultMaxDocFreqPercent
public static final float defaultMaxDocFreqPercent
- See Also:
- Constant Field Values
-
-
Constructor Detail
-
QueryAutoStopWordAnalyzer
@Deprecated public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate)
Deprecated.Stopwords should be calculated at instantiation using one of the other constructorsInitializes this analyzer with the Analyzer object that actually produces the tokens- Parameters:
delegate
- The choice ofAnalyzer
that is used to produce the token stream which needs filtering
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader) throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater thandefaultMaxDocFreqPercent
- Parameters:
matchVersion
- Version to be used inStopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords from- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, int maxDocFreq) throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq- Parameters:
matchVersion
- Version to be used inStopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxDocFreq
- Document frequency terms should be above in order to be stopwords- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, float maxPercentDocs) throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs- Parameters:
matchVersion
- Version to be used inStopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords frommaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, Collection<String> fields, float maxPercentDocs) throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs- Parameters:
matchVersion
- Version to be used inStopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
QueryAutoStopWordAnalyzer
public QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion, org.apache.lucene.analysis.Analyzer delegate, org.apache.lucene.index.IndexReader indexReader, Collection<String> fields, int maxDocFreq) throws IOException
Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq- Parameters:
matchVersion
- Version to be used inStopFilter
delegate
- Analyzer whose TokenStream will be filteredindexReader
- IndexReader to identify the stopwords fromfields
- Selection of fields to calculate stopwords formaxDocFreq
- Document frequency terms should be above in order to be stopwords- Throws:
IOException
- Can be thrown while reading from the IndexReader
-
-
Method Detail
-
addStopWords
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader) throws IOException
Deprecated.Stopwords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader)
Automatically adds stop words for all fields with terms exceeding the defaultMaxDocFreqPercent- Parameters:
reader
- TheIndexReader
which will be consulted to identify potential stop words that exceed the required document frequency- Returns:
- The number of stop words identified.
- Throws:
IOException
-
addStopWords
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, int maxDocFreq) throws IOException
Deprecated.Stopwords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, int)
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent- Parameters:
reader
- TheIndexReader
which will be consulted to identify potential stop words that exceed the required document frequencymaxDocFreq
- The maximum number of index documents which can contain a term, after which the term is considered to be a stop word- Returns:
- The number of stop words identified.
- Throws:
IOException
-
addStopWords
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, float maxPercentDocs) throws IOException
Deprecated.Stowords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, float)
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent- Parameters:
reader
- TheIndexReader
which will be consulted to identify potential stop words that exceed the required document frequencymaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.- Returns:
- The number of stop words identified.
- Throws:
IOException
-
addStopWords
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, float maxPercentDocs) throws IOException
Deprecated.Stowords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, float)
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs- Parameters:
reader
- TheIndexReader
which will be consulted to identify potential stop words that exceed the required document frequencyfieldName
- The field for which stopwords will be addedmaxPercentDocs
- The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.- Returns:
- The number of stop words identified.
- Throws:
IOException
-
addStopWords
@Deprecated public int addStopWords(org.apache.lucene.index.IndexReader reader, String fieldName, int maxDocFreq) throws IOException
Deprecated.Stowords should be calculated at instantiation usingQueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, int)
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs- Parameters:
reader
- TheIndexReader
which will be consulted to identify potential stop words that exceed the required document frequencyfieldName
- The field for which stopwords will be addedmaxDocFreq
- The maximum number of index documents which can contain a term, after which the term is considered to be a stop word.- Returns:
- The number of stop words identified.
- Throws:
IOException
-
tokenStream
public org.apache.lucene.analysis.TokenStream tokenStream(String fieldName, Reader reader)
- Specified by:
tokenStream
in classorg.apache.lucene.analysis.Analyzer
-
reusableTokenStream
public org.apache.lucene.analysis.TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException
- Overrides:
reusableTokenStream
in classorg.apache.lucene.analysis.Analyzer
- Throws:
IOException
-
getStopWords
public String[] getStopWords(String fieldName)
Provides information on which stop words have been identified for a field- Parameters:
fieldName
- The field for which stop words identified in "addStopWords" method calls will be returned- Returns:
- the stop words identified for a field
-
getStopWords
public org.apache.lucene.index.Term[] getStopWords()
Provides information on which stop words have been identified for all fields- Returns:
- the stop words (as terms)
-
-