When a file is annexed, a key is generated from its content and/or metadata. The file checked into git symlinks to the key. This key can later be used to retrieve the file's content (its value).
Multiple pluggable key-value backends are supported, and a single repository can use different ones for different files.
These are the recommended backends to use.
SHA256E
-- The default backend for new files, combines a 256 bit SHA-2 hash of the file's content with the file's extension. This allows verifying that the file content is right, and can avoid duplicates of files with the same content. Its need to generate checksums can make it slower for large files.SHA256
-- SHA-2 hash that does not include the file extension in the key, which can lead to better deduplication but can confuse some programs.SHA512
,SHA512E
-- Best SHA-2 hash, for the very paranoid.SHA384
,SHA384E
,SHA224
,SHA224E
-- SHA-2 hashes for people who like unusual sizes.SHA3_512
,SHA3_512E
,SHA3_384
,SHA3_384E
,SHA3_256
,SHA3_256E
,SHA3_224
,SHA3_224E
-- SHA-3 hashes, for bleeding edge fun.SKEIN512
,SKEIN512E
,SKEIN256
,SKEIN256E
-- Skein hash, a well-regarded SHA3 hash competition finalist.BLAKE2B160
,BLAKE2B224
,BLAKE2B256
,BLAKE2B384
,BLAKE2B512
BLAKE2B160E
,BLAKE2B224E
,BLAKE2B256E
,BLAKE2B384E
,BLAKE2B512E
-- Fast Blake2 hash variants optimised for 64 bit platforms.BLAKE2S160
,BLAKE2S224
,BLAKE2S256
BLAKE2S160E
,BLAKE2S224E
,BLAKE2S256E
-- Fast Blake2 hash variants optimised for 32 bit platforms.BLAKE2SP224
,BLAKE2SP256
BLAKE2SP224E
,BLAKE2SP256E
-- Fast Blake2 hash variants optimised for 8-way CPUs.
The backends below do not guarantee cryptographically that the content of an annexed file remains unchanged.
SHA1
,SHA1E
,MD5
,MD5E
-- Smaller hashes thanSHA256
for those who want a checksum but are not concerned about security.WORM
("Write Once, Read Many") -- This assumes that any file with the same filename, size, and modification time has the same content. This is the least expensive backend, recommended for really large files or slow systems.URL
-- This is a key that is generated from the url to a file. It's generated when using eg,git annex addurl --fast
, when the file content is not available for hashing.
If you want to be able to prove that you're working with the same file contents that were checked into a repository earlier, you should avoid using the non-cryptographically-secure backends, and will need to use signed git commits. See using signed git commits for details.
Retrieval of WORM and URL from many special remotes is prohibited for security reasons.
Note that the various 512 and 384 length hashes result in long paths, which are known to not work on Windows. If interoperability on Windows is a concern, avoid those.
The annex.backend
git-config setting can be used to configure the
default backend to use when adding new files.
For finer control of what backend is used when adding different types of
files, the .gitattributes
file can be used. The annex.backend
attribute can be set to the name of the backend to use for matching files.
For example, to use the SHA256E backend for sound files, which tend to be
smallish and might be modified or copied over time,
while using the WORM backend for everything else, you could set
in .gitattributes
:
* annex.backend=WORM
*.mp3 annex.backend=SHA256E
*.ogg annex.backend=SHA256E
It turns out that (at least on x86-64 machines)
SHA512
is faster thanSHA256
. In some benchmarks I performed1SHA256
was 1.8–2.2x slower thanSHA1
whileSHA512
was only 1.5–1.6x slower.SHA224
andSHA384
are effectively just truncated versions ofSHA256
andSHA512
so their performance characteristics are identical.1
time head -c 100000000 /dev/zero | shasum -a 512
In case you came here looking for the URL backend.
The URL backend
Several documents on the web refer to a special "URL backend", e.g. Large file management with git-annex [LWN.net]. Historical content will never be updated yet it drives people to living places.
Why a URL backend ?
It is interesting because you can:
git-annex
rest on the fact that some documents are available as extra copies available at any time (but from something that is not a git repository).How/Where now ?
git-annex
used to have a URL backend. It seems that the design changed into a "special remote" feature, not limited to the web. You can now track files available through plain directories, rsync, webdav, some cloud storage, etc, even clay tablets. For details see special remotes.It's a bit confusing to read that SHA256 does not include the file extension from which I can deduct that SHA256E does include it. What else does it include? I used to "seed" my git-annex with localy available data by "git-annex add"-ing it in a temporary folder without doing a commit and than to initiate a copy from the slow remote annex repo. My theory was that remote copy sees the pre-seeded files and does not need to copy them again.
But does this theory hold true for different file names, extensions, modification date, full path? Maybe you could also link to the code that implements the different backends so that curious readers can check for themselves.
Thank you!
I'd really like to have a SHA256e backend -- same as SHA256E but making sure that extensions of the files in .git/annex are converted to lower case. I normally try to convert filenames from cameras etc to lower case, but not all people that I share annex with do so consistently. In my use case, I need to be able to find duplicates among files and .jpg vs .JPG throws git annex dedup off. Otherwise E backends are superior to non-E for me. Thanks, Michael.
Related to the question posed in http://git-annex.branchable.com/forum/switching_backends/ can git annex be told to use the existing backend for a given file?
The use case for this is that you have an existing repo that started out e.g. with SHA256, but new files are being added with SHA256E since that's the default now.
But I was doing:
And was expecting it to show no changes for existing files, but it did, it would be nice if that was not the case.
git annex add --backend=SHA256
to temporarily override the backend.the SHA* backends generate too-complicated paths:
lrwxrwxrwx 1 root root 193 Apr 22 2009 test.ogg -> ../../../.git/annex/objects/fX/pz/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890/SHA256-s71983--4a55ff578b4c592c06a1f4d9e0f8a6949ea9961d9717fc22e7b3c412620ac890
I don't want the additional directory. What is it for?? It contains exactly one file and adds a couple of disk seeks to file lookup.
@Matthias, that directory structure is not controlled by the backend. It is explained in internals
probably in many cases MD5SUM might be sufficient to cover the space of the available load and
or use of MD5SUM hash is really not recommended for non-encryption-critical cases too?
I've added MD5 and MD5E. Of course, if you choose to use these, or the WORM backend, you give up the cryptographic verification that the content currently in your repository is the same content that was in it before. Whether that matters in your application is up to you.
It's not explicit, but 'git annex info $FILE' tells you the key, which has the backend as its first component:
I don't think there are any situations where the first component of the key isn't the backend, but don't hold me to that, please :)
Or I could not be an idiot and tell you the command specifically looking up a key for a file: lookupkey
So to get the backend (if the first component is always the backend):
Hi,
I'd like to be able verify the consistancy of the files on a rsync remote without having access to the git repository or the gpg-key. This can easily be done with unencrypted files by running "sha256sum filename". Is there a way to do the same thing with encrypted files?
Thank you very much!
@junk, this page is not really the place to ask such an unrelated question. Please use the forum for such questions.
(Anyway, git-annex uses gpg to encrypt data, so you can perhaps use gpg to check the embedded checksum, but I have never done it, and git-annex certianly doesn't support doing it.)
Hello,
TL;DR: I second Michael's wish for hashing backends that aligns extensions to lowercase.
Context, files with same content, extension have different case
I realized a moment ago that git-annex basically automatically deduplicates with file granularity, which is very nice... unless duplicates have varying case, which does happen. For some cameras, if you download files through a cable you get one file name with one case, if you read the card directly with a card reader you get another case (and another filename, by the way).
In invite anyone interested to drop a line here.
Workaround
I understand I can align case after-the-fact with a bash shell command like below. Beware: man page of
rename
says there exist other versions that don't check for destination file, so the line below in some specific case (two files with same name, different content, file name only differs in case extension) might cause you to lose some information. Or perhaps other cases. Make sure you know what you do, I'm not responsible.If you prefer to align to upper-case, replace
,,
with^^
. This is bash syntax.Please consider
SHA256e
backend (and others).Anyway the shell command above is a workaround. A case-insensitive hashing backend seems a natural thing to do. It would bring the best of both worlds: deduplicate efficiently while not confusing programs that depend on symlink target having a particular extension.