Maybe you had a lot of files scattered around on different drives, and you added them all into a single git-annex repository. Some of the files are surely duplicates of others.

While git-annex stores the file contents efficiently, it would still help in cleaning up this mess if you could find, and perhaps remove the duplicate files.

Here's a command line that will show duplicate sets of files grouped together:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --all-repeated=separate -f1 | \
    sed 's/ [^ ]*$//'

Here's a command line that will remove one of each duplicate set of files:

git annex find --include '*' --format='${file} ${escaped_key}\n' | \
    sort -k2 | uniq --repeated -f1 | sed 's/ [^ ]*$//' | \
    xargs -d '\n' git rm

--Joey

Very nice :) Just for reference, here's my Perl implementation. As per this discussion it would be interesting to benchmark these two approaches and see if one is substantially more efficient than the other w.r.t. CPU and memory usage.
Comment by http://adamspiers.myopenid.com/ Fri Dec 23 19:16:50 2011
note that the sort -k2 doesn't work right for filenames with spaces in them. On the other hand, git-rm doesn't seem to like the escaped names from escaped_file.
Comment by bremner Wed Sep 5 02:12:18 2012
problems with spaces in filenames

Spaces, and other special chars can make filename handeling ugly. If you don't have a restriction on keeping the exact filenames, then it might be easiest just to get rid of the problematic chars.

#!/bin/bash

function process() {
    dir="$1"
    echo "processing $dir"
    pushd $dir >/dev/null 2>&1

    for fileOrDir in *; do
        nfileOrDir=`echo "$fileOrDir" | sed -e 's/\[//g' -e 's/\]//g' -e 's/ /_/g' -e "s/'//g" `
        if [ "$fileOrDir" != "$nfileOrDir" ]; then
            echo renaming $fileOrDir to $nfileOrDir
            git mv "$fileOrDir" "$nfileOrDir"
        else
            echo "skipping $fileOrDir, no need to rename."
        fi
    done

    find ./ -mindepth 1 -maxdepth 1 -type d | while read d; do
    process "$d"
    done
    popd >/dev/null 2>&1
}

process .

Maybe you can run something like this before checking for duplicates.

Comment by mhameed Wed Sep 5 08:38:56 2012
Ironically, previous renaming to remove spaces, plus some synching is how I ended up with these duplicates. For what it is worth, aspiers perl script worked out for me with a small modification. I just only printed out the duplicates with spaces in them (quoted).
Comment by bremner Sun Sep 9 19:33:01 2012
Since the keys are sure to have nos paces in them, putting them first makes working with the output with tools like sort, uniq, and awk simpler.

Is there any simple way to search for files with a given key?

At the moment, the best I've come up with is this:

git annex find --include '*' --format='${key} ${file}' | grep <KEY>

where <KEY> is the key. This seems like an awfully longwinded approach, but I don't see anything in the docs indicating a simpler way to do it. Am I missing something?

@Chris I guess there's no really easy way because searching for a given key is not something many people need to do.

However, git does provide a way. Try git log --stat -S $KEY

Comment by http://joeyh.name/ Mon May 13 18:42:14 2013
Comments on this page are closed.