Class CompoundWordTokenFilterBase

  • All Implemented Interfaces:
    Closeable, AutoCloseable
    Direct Known Subclasses:
    DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter

    public abstract class CompoundWordTokenFilterBase
    extends org.apache.lucene.analysis.TokenFilter
    Base class for decomposition token filters.

    You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

    If you pass in a CharArraySet as dictionary, it should be case-insensitive unless it contains only lowercased entries and you have LowerCaseFilter before this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary Sets to the ctors or String[] dictionaries, they will be automatically transformed to case-insensitive!

    • Field Detail

      • DEFAULT_MIN_WORD_SIZE

        public static final int DEFAULT_MIN_WORD_SIZE
        The default for minimal word length that gets decomposed
        See Also:
        Constant Field Values
      • DEFAULT_MIN_SUBWORD_SIZE

        public static final int DEFAULT_MIN_SUBWORD_SIZE
        The default for minimal length of subwords that get propagated to the output of this filter
        See Also:
        Constant Field Values
      • DEFAULT_MAX_SUBWORD_SIZE

        public static final int DEFAULT_MAX_SUBWORD_SIZE
        The default for maximal length of subwords that get propagated to the output of this filter
        See Also:
        Constant Field Values
      • dictionary

        protected final org.apache.lucene.analysis.CharArraySet dictionary
      • minWordSize

        protected final int minWordSize
      • minSubwordSize

        protected final int minSubwordSize
      • maxSubwordSize

        protected final int maxSubwordSize
      • onlyLongestMatch

        protected final boolean onlyLongestMatch
      • termAtt

        protected final org.apache.lucene.analysis.tokenattributes.CharTermAttribute termAtt
      • offsetAtt

        protected final org.apache.lucene.analysis.tokenattributes.OffsetAttribute offsetAtt
    • Constructor Detail

      • CompoundWordTokenFilterBase

        protected CompoundWordTokenFilterBase​(org.apache.lucene.util.Version matchVersion,
                                              org.apache.lucene.analysis.TokenStream input,
                                              String[] dictionary,
                                              int minWordSize,
                                              int minSubwordSize,
                                              int maxSubwordSize,
                                              boolean onlyLongestMatch)
      • CompoundWordTokenFilterBase

        protected CompoundWordTokenFilterBase​(org.apache.lucene.util.Version matchVersion,
                                              org.apache.lucene.analysis.TokenStream input,
                                              String[] dictionary,
                                              boolean onlyLongestMatch)
      • CompoundWordTokenFilterBase

        protected CompoundWordTokenFilterBase​(org.apache.lucene.util.Version matchVersion,
                                              org.apache.lucene.analysis.TokenStream input,
                                              Set<?> dictionary,
                                              boolean onlyLongestMatch)
      • CompoundWordTokenFilterBase

        protected CompoundWordTokenFilterBase​(org.apache.lucene.util.Version matchVersion,
                                              org.apache.lucene.analysis.TokenStream input,
                                              String[] dictionary)
      • CompoundWordTokenFilterBase

        protected CompoundWordTokenFilterBase​(org.apache.lucene.util.Version matchVersion,
                                              org.apache.lucene.analysis.TokenStream input,
                                              Set<?> dictionary)
      • CompoundWordTokenFilterBase

        protected CompoundWordTokenFilterBase​(org.apache.lucene.util.Version matchVersion,
                                              org.apache.lucene.analysis.TokenStream input,
                                              Set<?> dictionary,
                                              int minWordSize,
                                              int minSubwordSize,
                                              int maxSubwordSize,
                                              boolean onlyLongestMatch)
    • Method Detail

      • makeDictionary

        @Deprecated
        public static org.apache.lucene.analysis.CharArraySet makeDictionary​(org.apache.lucene.util.Version matchVersion,
                                                                             String[] dictionary)
        Deprecated.
        Only available for backwards compatibility.
      • incrementToken

        public final boolean incrementToken()
                                     throws IOException
        Specified by:
        incrementToken in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • reset

        public void reset()
                   throws IOException
        Overrides:
        reset in class org.apache.lucene.analysis.TokenFilter
        Throws:
        IOException