Class DictionaryCompoundWordTokenFilter

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class DictionaryCompoundWordTokenFilter
    extends CompoundWordTokenFilterBase
    A TokenFilter that decomposes compound words found in many Germanic languages.

    "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a brute-force algorithm to achieve this.

    You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

    If you pass in a CharArraySet as dictionary, it should be case-insensitive unless it contains only lowercased entries and you have LowerCaseFilter before this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary Sets to the ctors or String[] dictionaries, they will be automatically transformed to case-insensitive!

    • Constructor Detail

      • DictionaryCompoundWordTokenFilter

        @Deprecated
        public DictionaryCompoundWordTokenFilter​(org.apache.lucene.analysis.TokenStream input,
                                                 String[] dictionary,
                                                 int minWordSize,
                                                 int minSubwordSize,
                                                 int maxSubwordSize,
                                                 boolean onlyLongestMatch)
        Parameters:
        input - the TokenStream to process
        dictionary - the word dictionary to match against
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream
      • DictionaryCompoundWordTokenFilter

        @Deprecated
        public DictionaryCompoundWordTokenFilter​(org.apache.lucene.analysis.TokenStream input,
                                                 Set dictionary,
                                                 int minWordSize,
                                                 int minSubwordSize,
                                                 int maxSubwordSize,
                                                 boolean onlyLongestMatch)
        Parameters:
        input - the TokenStream to process
        dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream
      • DictionaryCompoundWordTokenFilter

        @Deprecated
        public DictionaryCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                 org.apache.lucene.analysis.TokenStream input,
                                                 String[] dictionary,
                                                 int minWordSize,
                                                 int minSubwordSize,
                                                 int maxSubwordSize,
                                                 boolean onlyLongestMatch)
        Deprecated.
        Use the constructors taking Set
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        dictionary - the word dictionary to match against
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream
      • DictionaryCompoundWordTokenFilter

        @Deprecated
        public DictionaryCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                 org.apache.lucene.analysis.TokenStream input,
                                                 String[] dictionary)
        Deprecated.
        Use the constructors taking Set
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        dictionary - the word dictionary to match against
      • DictionaryCompoundWordTokenFilter

        public DictionaryCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                 org.apache.lucene.analysis.TokenStream input,
                                                 Set<?> dictionary)
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        dictionary - the word dictionary to match against.
      • DictionaryCompoundWordTokenFilter

        public DictionaryCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                 org.apache.lucene.analysis.TokenStream input,
                                                 Set<?> dictionary,
                                                 int minWordSize,
                                                 int minSubwordSize,
                                                 int maxSubwordSize,
                                                 boolean onlyLongestMatch)
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        dictionary - the word dictionary to match against.
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream