Class FSTLookup


  • @Deprecated
    public class FSTLookup
    extends Lookup
    Deprecated.
    Use FSTCompletionLookup instead.
    Finite state automata based implementation of Lookup query suggestion/ autocomplete interface.

    Implementation details

    The construction step in build(TermFreqIterator) works as follows:

    • A set of input terms (String) and weights (float) is given.
    • The range of weights is determined and then all weights are discretized into a fixed set of values (buckets). Note that this means that minor changes in weights may be lost during automaton construction. In general, this is not a big problem because the "priorities" of completions can be split into a fixed set of classes (even as rough as: very frequent, frequent, baseline, marginal). If you need exact, fine-grained weights, use TSTLookup instead.
    • All terms in the input are preprended with a synthetic pseudo-character being the weight of that term. For example a term abc with a discretized weight equal '1' would become 1abc.
    • The terms are sorted by their raw value of utf16 character values (including the synthetic term in front).
    • A finite state automaton (FST) is constructed from the input. The root node has arcs labeled with all possible weights. We cache all these arcs, highest-weight first.

    At runtime, in lookup(CharSequence, boolean, int), the automaton is utilized as follows:

    • For each possible term weight encoded in the automaton (cached arcs from the root above), starting with the highest one, we descend along the path of the input key. If the key is not a prefix of a sequence in the automaton (path ends prematurely), we exit immediately. No completions.
    • Otherwise, we have found an internal automaton node that ends the key. The entire subautomaton (all paths) starting from this node form the key's completions. We start the traversal of this subautomaton. Every time we reach a final state (arc), we add a single suggestion to the list of results (the weight of this suggestion is constant and equal to the root path we started from). The tricky part is that because automaton edges are sorted and we scan depth-first, we can terminate the entire procedure as soon as we collect enough suggestions the user requested.
    • In case the number of suggestions collected in the step above is still insufficient, we proceed to the next (smaller) weight leaving the root node and repeat the same algorithm again.

    Runtime behavior and performance characteristic

    The algorithm described above is optimized for finding suggestions to short prefixes in a top-weights-first order. This is probably the most common use case: it allows presenting suggestions early and sorts them by the global frequency (and then alphabetically).

    If there is an exact match in the automaton, it is returned first on the results list (even with by-weight sorting).

    Note that the maximum lookup time for any prefix is the time of descending to the subtree, plus traversal of the subtree up to the number of requested suggestions (because they are already presorted by weight on the root level and alphabetically at any node level).

    To order alphabetically only (no ordering by priorities), use identical term weights for all terms. Alphabetical suggestions are returned even if non-constant weights are used, but the algorithm for doing this is suboptimal.

    "alphabetically" in any of the documentation above indicates utf16 codepoint order, nothing else.

    • Field Detail

    • Constructor Detail

      • FSTLookup

        public FSTLookup()
        Deprecated.
      • FSTLookup

        public FSTLookup​(int buckets,
                         boolean exactMatchFirst)
        Deprecated.
    • Method Detail

      • get

        public Float get​(CharSequence key)
        Deprecated.
        Get the (approximated) weight of a single key (if there is a perfect match for it in the automaton).
        Returns:
        Returns the approximated weight of the input key or null if not found.
      • lookup

        public List<Lookup.LookupResult> lookup​(CharSequence key,
                                                boolean onlyMorePopular,
                                                int num)
        Deprecated.
        Lookup autocomplete suggestions to key.
        Specified by:
        lookup in class Lookup
        Parameters:
        key - The prefix to which suggestions should be sought.
        onlyMorePopular - Return most popular suggestions first. This is the default behavior for this implementation. Setting it to false has no effect (use constant term weights to sort alphabetically only).
        num - At most this number of suggestions will be returned.
        Returns:
        Returns the suggestions, sorted by their approximated weight first (decreasing) and then alphabetically (utf16 codepoint order).
      • store

        public boolean store​(OutputStream output)
                      throws IOException
        Deprecated.
        Description copied from class: Lookup
        Persist the constructed lookup data to a directory. Optional operation.
        Specified by:
        store in class Lookup
        Parameters:
        output - OutputStream to write the data to.
        Returns:
        true if successful, false if unsuccessful or not supported.
        Throws:
        IOException - when fatal IO error occurs.
      • load

        public boolean load​(InputStream input)
                     throws IOException
        Deprecated.
        Description copied from class: Lookup
        Discard current lookup data and load it from a previously saved copy. Optional operation.
        Specified by:
        load in class Lookup
        Parameters:
        input - the InputStream to load the lookup data.
        Returns:
        true if completed successfully, false if unsuccessful or not supported.
        Throws:
        IOException - when fatal IO error occurs.