Sequence data normally need's a lot of memory. To be able to
handle thousands of sequences we implemented an online
compression. All data is compressed most of the time and only
uncompressed on demand. As a user you only find smaller database
files, that's all.
Without understanding the data, the program can compress data only
by a limited factor. With the help of a tree aligned sequences
can be compressed much better by storing only the differences
to a consensus sequence.
Once a sequence is compressed using a tree, it will keep
the good compression method until it is changed. Then only the
older method is used.
As long as you change only a few (up to 100) sequences, the
database won't grow very much.
To compress the entire database, the program needs a tree,
which should cover most of the sequences. The larger and better
the tree, the better the compression.
|