Class ShingleMatrixFilter

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    @Deprecated
    public final class ShingleMatrixFilter
    extends org.apache.lucene.analysis.TokenStream
    Deprecated.
    Will be removed in Lucene 4.0. This filter is unmaintained and might not behave correctly if used with custom Attributes, i.e. Attributes other than the ones located in org.apache.lucene.analysis.tokenattributes. It also uses hardcoded payload encoders which makes it not easily adaptable to other use-cases.

    A ShingleMatrixFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.

    For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".

    Using a shingle filter at index and query time can in some instances be used to replace phrase queries, especially them with 0 slop.

    Without a spacer character it can be used to handle composition and decomposition of words such as searching for "multi dimensional" instead of "multidimensional". It is a rather common human problem at query time in several languages, notably the northern Germanic branch.

    Shingles are amongst many things also known to solve problems in spell checking, language detection and document clustering.

    This filter is backed by a three dimensional column oriented matrix used to create permutations of the second dimension, the rows, and leaves the third, the z-axis, for for multi token synonyms.

    In order to use this filter you need to define a way of positioning the input stream tokens in the matrix. This is done using a ShingleMatrixFilter.TokenSettingsCodec. There are three simple implementations for demonstrational purposes, see ShingleMatrixFilter.OneDimensionalNonWeightedTokenSettingsCodec, ShingleMatrixFilter.TwoDimensionalNonWeightedSynonymTokenSettingsCodec and ShingleMatrixFilter.SimpleThreeDimensionalTokenSettingsCodec.

    Consider this token matrix:

      Token[column][row][z-axis]{
        {{hello}, {greetings, and, salutations}},
        {{world}, {earth}, {tellus}}
      };
     
    It would produce the following 2-3 gram sized shingles:
     "hello_world"
     "greetings_and"
     "greetings_and_salutations"
     "and_salutations"
     "and_salutations_world"
     "salutations_world"
     "hello_earth"
     "and_salutations_earth"
     "salutations_earth"
     "hello_tellus"
     "and_salutations_tellus"
     "salutations_tellus"
      

    This implementation can be rather heap demanding if (maximum shingle size - minimum shingle size) is a great number and the stream contains many columns, or if each column contains a great number of rows.

    The problem is that in order avoid producing duplicates the filter needs to keep track of any shingle already produced and returned to the consumer. There is a bit of resource management to handle this but it would of course be much better if the filter was written so it never created the same shingle more than once in the first place.

    The filter also has basic support for calculating weights for the shingles based on the weights of the tokens from the input stream, output shingle size, etc. See calculateShingleWeight(org.apache.lucene.analysis.Token, java.util.List, int, java.util.List, java.util.List).

    • Field Detail

      • defaultSpacerCharacter

        public static Character defaultSpacerCharacter
        Deprecated.
      • ignoringSinglePrefixOrSuffixShingleByDefault

        public static boolean ignoringSinglePrefixOrSuffixShingleByDefault
        Deprecated.
    • Constructor Detail

      • ShingleMatrixFilter

        public ShingleMatrixFilter​(ShingleMatrixFilter.Matrix matrix,
                                   int minimumShingleSize,
                                   int maximumShingleSize,
                                   Character spacerCharacter,
                                   boolean ignoringSinglePrefixOrSuffixShingle,
                                   ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
        Deprecated.
        Creates a shingle filter based on a user defined matrix. The filter /will/ delete columns from the input matrix! You will not be able to reset the filter if you used this constructor. todo: don't touch the matrix! use a boolean, set the input stream to null or something, and keep track of where in the matrix we are at.
        Parameters:
        matrix - the input based for creating shingles. Does not need to contain any information until incrementToken() is called the first time.
        minimumShingleSize - minimum number of tokens in any shingle.
        maximumShingleSize - maximum number of tokens in any shingle.
        spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
        ignoringSinglePrefixOrSuffixShingle - if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
        settingsCodec - codec used to read input token weight and matrix positioning.
      • ShingleMatrixFilter

        public ShingleMatrixFilter​(org.apache.lucene.analysis.TokenStream input,
                                   int minimumShingleSize,
                                   int maximumShingleSize)
        Deprecated.
        Creates a shingle filter using default settings.
        Parameters:
        input - stream from which to construct the matrix
        minimumShingleSize - minimum number of tokens in any shingle.
        maximumShingleSize - maximum number of tokens in any shingle.
        See Also:
        defaultSpacerCharacter, ignoringSinglePrefixOrSuffixShingleByDefault, defaultSettingsCodec
      • ShingleMatrixFilter

        public ShingleMatrixFilter​(org.apache.lucene.analysis.TokenStream input,
                                   int minimumShingleSize,
                                   int maximumShingleSize,
                                   Character spacerCharacter)
        Deprecated.
        Creates a shingle filter using default settings.
        Parameters:
        input - stream from which to construct the matrix
        minimumShingleSize - minimum number of tokens in any shingle.
        maximumShingleSize - maximum number of tokens in any shingle.
        spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
        See Also:
        ignoringSinglePrefixOrSuffixShingleByDefault, defaultSettingsCodec
      • ShingleMatrixFilter

        public ShingleMatrixFilter​(org.apache.lucene.analysis.TokenStream input,
                                   int minimumShingleSize,
                                   int maximumShingleSize,
                                   Character spacerCharacter,
                                   boolean ignoringSinglePrefixOrSuffixShingle)
        Deprecated.
        Creates a shingle filter using the default ShingleMatrixFilter.TokenSettingsCodec.
        Parameters:
        input - stream from which to construct the matrix
        minimumShingleSize - minimum number of tokens in any shingle.
        maximumShingleSize - maximum number of tokens in any shingle.
        spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
        ignoringSinglePrefixOrSuffixShingle - if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
        See Also:
        defaultSettingsCodec
      • ShingleMatrixFilter

        public ShingleMatrixFilter​(org.apache.lucene.analysis.TokenStream input,
                                   int minimumShingleSize,
                                   int maximumShingleSize,
                                   Character spacerCharacter,
                                   boolean ignoringSinglePrefixOrSuffixShingle,
                                   ShingleMatrixFilter.TokenSettingsCodec settingsCodec)
        Deprecated.
        Creates a shingle filter with ad hoc parameter settings.
        Parameters:
        input - stream from which to construct the matrix
        minimumShingleSize - minimum number of tokens in any shingle.
        maximumShingleSize - maximum number of tokens in any shingle.
        spacerCharacter - character to use between texts of the token parts in a shingle. null for none.
        ignoringSinglePrefixOrSuffixShingle - if true, shingles that only contains permutation of the first of the last column will not be produced as shingles. Useful when adding boundary marker tokens such as '^' and '$'.
        settingsCodec - codec used to read input token weight and matrix positioning.
    • Method Detail

      • reset

        public void reset()
                   throws IOException
        Deprecated.
        Overrides:
        reset in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • incrementToken

        public final boolean incrementToken()
                                     throws IOException
        Deprecated.
        Specified by:
        incrementToken in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • updateToken

        public void updateToken​(org.apache.lucene.analysis.Token token,
                                List<org.apache.lucene.analysis.Token> shingle,
                                int currentPermutationStartOffset,
                                List<ShingleMatrixFilter.Matrix.Column.Row> currentPermutationRows,
                                List<org.apache.lucene.analysis.Token> currentPermuationTokens)
        Deprecated.
        Final touch of a shingle token before it is passed on to the consumer from method incrementToken(). Calculates and sets type, flags, position increment, start/end offsets and weight.
        Parameters:
        token - Shingle token
        shingle - Tokens used to produce the shingle token.
        currentPermutationStartOffset - Start offset in parameter currentPermutationTokens
        currentPermutationRows - index to Matrix.Column.Row from the position of tokens in parameter currentPermutationTokens
        currentPermuationTokens - tokens of the current permutation of rows in the matrix.
      • calculateShingleWeight

        public float calculateShingleWeight​(org.apache.lucene.analysis.Token shingleToken,
                                            List<org.apache.lucene.analysis.Token> shingle,
                                            int currentPermutationStartOffset,
                                            List<ShingleMatrixFilter.Matrix.Column.Row> currentPermutationRows,
                                            List<org.apache.lucene.analysis.Token> currentPermuationTokens)
        Deprecated.
        Evaluates the new shingle token weight. for (shingle part token in shingle) weight += shingle part token weight * (1 / sqrt(all shingle part token weights summed)) This algorithm gives a slightly greater score for longer shingles and is rather penalising to great shingle token part weights.
        Parameters:
        shingleToken - token returned to consumer
        shingle - tokens the tokens used to produce the shingle token.
        currentPermutationStartOffset - start offset in parameter currentPermutationRows and currentPermutationTokens.
        currentPermutationRows - an index to what matrix row a token in parameter currentPermutationTokens exist.
        currentPermuationTokens - all tokens in the current row permutation of the matrix. A sub list (parameter offset, parameter shingle.size) equals parameter shingle.
        Returns:
        weight to be set for parameter shingleToken
      • getMinimumShingleSize

        public int getMinimumShingleSize()
        Deprecated.
      • setMinimumShingleSize

        public void setMinimumShingleSize​(int minimumShingleSize)
        Deprecated.
      • getMaximumShingleSize

        public int getMaximumShingleSize()
        Deprecated.
      • setMaximumShingleSize

        public void setMaximumShingleSize​(int maximumShingleSize)
        Deprecated.
      • getSpacerCharacter

        public Character getSpacerCharacter()
        Deprecated.
      • setSpacerCharacter

        public void setSpacerCharacter​(Character spacerCharacter)
        Deprecated.
      • isIgnoringSinglePrefixOrSuffixShingle

        public boolean isIgnoringSinglePrefixOrSuffixShingle()
        Deprecated.
      • setIgnoringSinglePrefixOrSuffixShingle

        public void setIgnoringSinglePrefixOrSuffixShingle​(boolean ignoringSinglePrefixOrSuffixShingle)
        Deprecated.