Class HyphenationCompoundWordTokenFilter

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class HyphenationCompoundWordTokenFilter
    extends CompoundWordTokenFilterBase
    A TokenFilter that decomposes compound words found in many Germanic languages.

    "Donaudampfschiff" becomes Donau, dampf, schiff so that you can find "Donaudampfschiff" even when you only enter "schiff". It uses a hyphenation grammar and a word dictionary to achieve this.

    You must specify the required Version compatibility when creating CompoundWordTokenFilterBase:

    • As of 3.1, CompoundWordTokenFilterBase correctly handles Unicode 4.0 supplementary characters in strings and char arrays provided as compound word dictionaries.

    If you pass in a CharArraySet as dictionary, it should be case-insensitive unless it contains only lowercased entries and you have LowerCaseFilter before this filter in your analysis chain. For optional performance (as this filter does lots of lookups to the dictionary, you should use the latter analysis chain/CharArraySet). Be aware: If you supply arbitrary Sets to the ctors or String[] dictionaries, they will be automatically transformed to case-insensitive!

    • Constructor Detail

      • HyphenationCompoundWordTokenFilter

        @Deprecated
        public HyphenationCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                  org.apache.lucene.analysis.TokenStream input,
                                                  HyphenationTree hyphenator,
                                                  String[] dictionary,
                                                  int minWordSize,
                                                  int minSubwordSize,
                                                  int maxSubwordSize,
                                                  boolean onlyLongestMatch)
        Deprecated.
        Use the constructors taking Set
        Creates a new HyphenationCompoundWordTokenFilter instance.
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        hyphenator - the hyphenation pattern tree to use for hyphenation
        dictionary - the word dictionary to match against
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream
      • HyphenationCompoundWordTokenFilter

        @Deprecated
        public HyphenationCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                  org.apache.lucene.analysis.TokenStream input,
                                                  HyphenationTree hyphenator,
                                                  String[] dictionary)
        Deprecated.
        Use the constructors taking Set
        Creates a new HyphenationCompoundWordTokenFilter instance.
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        hyphenator - the hyphenation pattern tree to use for hyphenation
        dictionary - the word dictionary to match against
      • HyphenationCompoundWordTokenFilter

        public HyphenationCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                  org.apache.lucene.analysis.TokenStream input,
                                                  HyphenationTree hyphenator,
                                                  Set<?> dictionary)
        Creates a new HyphenationCompoundWordTokenFilter instance.
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        hyphenator - the hyphenation pattern tree to use for hyphenation
        dictionary - the word dictionary to match against.
      • HyphenationCompoundWordTokenFilter

        public HyphenationCompoundWordTokenFilter​(org.apache.lucene.util.Version matchVersion,
                                                  org.apache.lucene.analysis.TokenStream input,
                                                  HyphenationTree hyphenator,
                                                  Set<?> dictionary,
                                                  int minWordSize,
                                                  int minSubwordSize,
                                                  int maxSubwordSize,
                                                  boolean onlyLongestMatch)
        Creates a new HyphenationCompoundWordTokenFilter instance.
        Parameters:
        matchVersion - Lucene version to enable correct Unicode 4.0 behavior in the dictionaries if Version > 3.0. See CompoundWordTokenFilterBase for details.
        input - the TokenStream to process
        hyphenator - the hyphenation pattern tree to use for hyphenation
        dictionary - the word dictionary to match against.
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream
      • HyphenationCompoundWordTokenFilter

        @Deprecated
        public HyphenationCompoundWordTokenFilter​(org.apache.lucene.analysis.TokenStream input,
                                                  HyphenationTree hyphenator,
                                                  String[] dictionary,
                                                  int minWordSize,
                                                  int minSubwordSize,
                                                  int maxSubwordSize,
                                                  boolean onlyLongestMatch)
        Creates a new HyphenationCompoundWordTokenFilter instance.
        Parameters:
        input - the TokenStream to process
        hyphenator - the hyphenation pattern tree to use for hyphenation
        dictionary - the word dictionary to match against
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream
      • HyphenationCompoundWordTokenFilter

        @Deprecated
        public HyphenationCompoundWordTokenFilter​(org.apache.lucene.analysis.TokenStream input,
                                                  HyphenationTree hyphenator,
                                                  Set<?> dictionary,
                                                  int minWordSize,
                                                  int minSubwordSize,
                                                  int maxSubwordSize,
                                                  boolean onlyLongestMatch)
        Creates a new HyphenationCompoundWordTokenFilter instance.
        Parameters:
        input - the TokenStream to process
        hyphenator - the hyphenation pattern tree to use for hyphenation
        dictionary - the word dictionary to match against. If this is a CharArraySet it must have set ignoreCase=false and only contain lower case strings.
        minWordSize - only words longer than this get processed
        minSubwordSize - only subwords longer than this get to the output stream
        maxSubwordSize - only subwords shorter than this get to the output stream
        onlyLongestMatch - Add only the longest matching subword to the stream