Class JapaneseTokenizer

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public final class JapaneseTokenizer
    extends org.apache.lucene.analysis.Tokenizer
    Tokenizer for Japanese that uses morphological analysis.

    This tokenizer sets a number of additional attributes:

    This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation as well.

    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  JapaneseTokenizer.Mode
      Tokenization mode: this determines how the tokenizer handles compound and unknown words.
      • Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

        org.apache.lucene.util.AttributeSource.AttributeFactory, org.apache.lucene.util.AttributeSource.State
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static JapaneseTokenizer.Mode DEFAULT_MODE
      Default tokenization mode.
      • Fields inherited from class org.apache.lucene.analysis.Tokenizer

        input
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void end()  
      boolean incrementToken()  
      void reset()  
      void reset​(Reader input)  
      void setGraphvizFormatter​(GraphvizFormatter dotOut)
      Expert: set this to produce graphviz (dot) output of the Viterbi lattice
      • Methods inherited from class org.apache.lucene.analysis.Tokenizer

        close, correctOffset
      • Methods inherited from class org.apache.lucene.util.AttributeSource

        addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toString
    • Constructor Detail

      • JapaneseTokenizer

        public JapaneseTokenizer​(Reader input,
                                 UserDictionary userDictionary,
                                 boolean discardPunctuation,
                                 JapaneseTokenizer.Mode mode)
        Create a new JapaneseTokenizer.
        Parameters:
        input - Reader containing text
        userDictionary - Optional: if non-null, user dictionary.
        discardPunctuation - true if punctuation tokens should be dropped from the output.
        mode - tokenization mode.
    • Method Detail

      • setGraphvizFormatter

        public void setGraphvizFormatter​(GraphvizFormatter dotOut)
        Expert: set this to produce graphviz (dot) output of the Viterbi lattice
      • reset

        public void reset​(Reader input)
                   throws IOException
        Overrides:
        reset in class org.apache.lucene.analysis.Tokenizer
        Throws:
        IOException
      • reset

        public void reset()
                   throws IOException
        Overrides:
        reset in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException
      • end

        public void end()
        Overrides:
        end in class org.apache.lucene.analysis.TokenStream
      • incrementToken

        public boolean incrementToken()
                               throws IOException
        Specified by:
        incrementToken in class org.apache.lucene.analysis.TokenStream
        Throws:
        IOException