Work on git-annex is crowdfunded. Joey blogs about his progress here on a semi-daily basis.

Spent a few hours improving gcrypt in some minor ways, including adding a --check option that the assistant can use to find out if a given repo is encrypted with dgit, and also tell if the necessary gpg key is available to decrypt it. Also merged in a fix to support subkeys, developed by a git-annex user who is the first person I've heard from who is using gcrypt. I don't want to maintain gcrypt, so I am glad its author has shown up again today.

Got mostly caught up on backlog. The main bug I was able to track down today is git-annex using a lot of memory in certian repositories. This turns out to have happened when a really large file was committed right intoo to the git repository (by mistake or on purpose). Some parts of git-annex buffer file contents in memory while trying to work out if they're git-annex keys. Fixed by making it first check if a file in git is marked as a symlink. Which was really hard to do!

At least 4 people ran into this bug, which makes me suspect that lots of people are messing up when using direct mode (probably due to not reading the documentation, or having git commit -a hardwired into their fingers, and forcing git to commit large files into their repos, rather than having git-annex manage them. Implementing direct mode guard seems more urgent now.


Today's work was sponsored by Amitai Schlair.

Posted Thu Sep 19 21:10:44 2013

Spent basically all of today getting the assistant to be able to handle gcrypt special remotes that already exist when it's told to add a USB drive. This was quite tricky! And I did have to skip handling gcrypt repos that are not git-annex special remotes.

Anyway, it's now almost easy to set up an encrypted sneakernet using a USB drive and some computers running the webapp. The only part that the assistant doesn't help with is gpg key management.

Plan is to make a release on Friday, and then try to also add support for encrypted git repositories on remote servers. Tomorrow I will try to get through some of the communications backlog that has been piling up while I was head down working on gcrypt.

Posted Wed Sep 18 20:12:41 2013

I decided to keep gpg key generation very simple for now. So it generates a special-purpose key that is only intended to be used by git-annex. It hardcodes some key parameters, like RSA and 4096 bits (maximum recommended by gpg at this time). And there is no password on the key, although you can of course edit it and set one. This is because anyone who can access the computer to get the key can also look at the files in your git-annex repository. Also because I can't rely on gpg-agent being installed everywhere. All these simplifying assumptions may be revisited later, but are enough for now for someone who doesn't know about gpg (so doesn't have a key already) and just wants an encrypted repo on a removable drive.

Put together a simple UI to deal with gpg taking quite a while to generate a key ...

genkey.png

repoinfo.png

Then I had to patch git-remote-gcrypt again, to have a per-remote signingkey setting, so that these special-purpose keys get used for signing their repo.

Next, need to add support for adding an existing gcrypt repo as a remote (assuming it's encrypted to an available key). Then, gcrypt repos on ssh servers..


Also dealt with build breakage caused by a new version of the Haskell DNS library.


Today's work was sponsored by Joseph Liu.

Posted Wed Sep 18 00:08:38 2013

Now the webapp can set up encrypted repositories on removable drives.

encryptdrive.png

This UI needs some work, and the button to create a new key is not wired up. Also if you have no gpg agent installed, there will be lots of password prompts at the console.

Forked git-remote-gcrypt to fix a bug. Hopefully my patch will be merged; for now I recommend installing my worked version.

Today's work was sponsored by Romain Lenglet.

Posted Mon Sep 16 20:47:36 2013

Fixed a typo that broke automatic youtube video support in addurl.


Now there's an easy way to get an overview of how close your repository is to meeting the configured numcopies settings (or when it exceeds them).

# time git annex status . 
[...]
numcopies stats: 
    numcopies +0: 6686
    numcopies +1: 3793
    numcopies +3: 3156
    numcopies +2: 2743
    numcopies -1: 1242
    numcopies -4: 1098
    numcopies -3: 1009
    numcopies +4: 372

This does make git annex status slow when run on a large directory tree, so --fast disables that.

Posted Sun Sep 15 23:23:18 2013

Implemented git annex forget --drop-dead, which is finally a way to remove all references to old repositories that you've marked as dead.

I've still not merged in the forget branch, because I developed this while slightly ill, and have not tested it very well yet.

Posted Sat Sep 14 03:11:06 2013

John Millikin came through and fixed that haskell-gnutls segfault on OSX that I developed a reproducible test case for the other day. It's a bit hard to test, since the bug doesn't always happen, but the fix is already deployed for Mountain Lion autobuilder.

However, I then found another way to make haskell-gnutls segfault, more reliably on OSX, and even sometimes on Linux. Just entering the wrong XMPP password in the assistant can trigger this crash. Hopefully John will work his magic again.


Meanwhile, I fixed the sync-after-forget problem. Now sync always forces its push of the git-annex branch (as does the assistant). I considered but rejected having sync do the kind of uuid-tagged branch push that the assistant sometimes falls back to if it's failing to do a normal sync. It's ugly, but worse, it wouldn't work in the workflow where multiple clients are syncing to a central bare repository, because they'd not pull down the hidden uuid-tagged branches, and without the assistant running on the repository, nothing would ever merge their data into the git-annex branch. Forcing the push of synced/git-annex was easy, once I satisfied myself that it was always ok to do so.

Also factored out a module that knows about all the different log files stored on the git-annex branch, which is all the support infrastructure that will be needed to make git annex forget --drop-dead work. Since this is basically a routing module, perhaps I'll get around to making it use a nice bidirectional routing library like Zwaluw one day.

Posted Sat Sep 14 03:11:06 2013

Now I can build git-annex twice as fast! And a typical incremental build is down to 10 seconds, from 51 seconds.

Spent a productive evening working with Guilhem to get his encryption patches reviewed and merged. Now there is a way to remove revoked gpg keys, and there is a new encryption scheme available that uses public key encryption by default rather than git-annex's usual approach. That's not for everyone, but it is a good option to have available.

Posted Sat Sep 14 03:11:06 2013

About half way done with a gcrypt special remote. I can initremote it (the hard part to get working), and can send files to it. Can't yet get files back, or remove files, and only local repositories work so far, but this is enough to know it's going to be pretty nice!

Did find one issue in gcrypt that I may need to develop a patch for: https://github.com/blake2-ppc/git-remote-gcrypt/issues/3

Posted Sat Sep 14 03:11:06 2013

I've been out sick. However, some things kept happening. Mesar contributed a build host, and the linux and android builds are now happening, hourly, there. (Thanks as well to the two other people who also offered hostng.) And I made a minor release to fix a bug in the test suite that I was pleased three different people reported.

Today, my main work was getting git-annex to notice when a gcrypt remote located on some removable drive mount point is not the same gcrypt remote that was mounted there before. I was able to finesse this so it re-configures things to use the new gcrypt remote, as long as it's a special remote it knows about. (Otherwise it has to ignore the remote.) So, encrypted repos on removable drives will work just as well as non-encrypted repos!

Also spent a while with rsync.net tech support trying to work out why someone's git-annex apparently opened a lot of concurrent ssh connections to rsync.net. Have not been able to reproduce the problem though.

Also, a lot of catch-up to traffic. Still 63 messages backlogged however, and still not entirely well..

Posted Sat Sep 14 03:11:06 2013

I try hard to keep this devblog about git-annex development and not me. However, it is a shame that what I wanted to be the beginning of my first real month of work funded by the new campaign has been marred by my home's internet connection being taken out by a lightning strike, and by illness. Nearly back on my feet after that, and waiting for my new laptop to finally get here.

Today's work: Finished up the git annex forget feature and merged it in. Fixed the bug that was causing the commit race detection code to incorrectly fire on the commit made by the transition code. Few other bits and pieces.

Posted Sat Sep 14 03:11:06 2013

Worked to get git-remote-gcrypt included in every git-annex autobuild bundle. (Except Windows; running a shell script there may need some work later..)

Next I want to work on making the assistant easily able to create encrypted git repositories on removable drives. Which will involve a UI to select which gpg key to use, or creating (and backing up!) a gpg key.

But, I got distracted chasing down some bugs on Windows. These were quite ugly; more direct mode mapping breakage which resulted in files not being accessible. Also fsck on Windows failed to detect and fix the problem. All fixed now. (If you use git-annex on Windows, you should certianly upgrade and run git annex fsck.)

As with most bugs in the Windows port, the underlying cause turned out to be stupid: isSymlink always returned False on Windows. Which makes sense from the perspective of Windows not quite having anything entirely like symlinks. But failed when that was being used to detect when files in the git tree being merged into the repository had the symlink bit set..

Did bug triage. Backlog down to 32 (mostly messages from August).

Posted Sat Sep 14 03:11:06 2013

Woke up with a pretty solid plan for gcrypt. It will be structured as a separate special remote, so initremote will be needed, with a gitrepo= parameter (unless the remote already exists). git-annex will then set up the git remote, including pushing to it (needed to get a gcrypt-id).

Didn't feel up to implementing that today. Instead I expectedly spent the day doing mostly Windows work, including setting up a VM on my new laptop for development. Including a ssh server in Windows, so I can script local builds and tests on Windows without ever having to touch the desktop. Much better!

Posted Sat Sep 14 03:11:06 2013

gcrpyt is fully working now. Most of the examples in fully encrypted git repositories with gcrypt should work.

A few known problems:

  • git annex sync refuses to sync with gcrypt remotes. some url parsing issue.
  • Swapping two drives with gcrypt repositories on the same mount point doesn't work yet.
  • http urls are not supported
Posted Sat Sep 14 03:11:06 2013

Yesterday I spent making a release, and shopping for a new laptop, since this one is dying. (Soon I'll be able to compile git-annex fast-ish! Yay!) And thinking about wishlist: dropping git-annex history.

Today, I added the git annex forget command. It's currently been lightly tested, seems to work, and is living in the forget branch until I gain confidence with it. It should be perfectly safe to use, even if it's buggy, because you can use git reflog git-annex to pull out and revert to an old version of your git-annex branch. So if you're been wanting this feature, please beta test!


I actually implemented something more generic than just forgetting git history. There's now a whole mechanism for git-annex doing distributed transitions of whatever sort is needed.

There were several subtleties involved in distributed transitions:

First is how to tell when a given transition has already been done on a branch. At first I was thinking that the transition log should include the sha of the first commit on the old branch that got rewritten. However, that would mean that after a single transition had been done, every git-annex branch merge would need to look up the first commit of the current branch, to see if it's done the transition yet. That's slow! Instead, transitions are logged with a timestamp, and as long as a branch contains a transition with the same timestamp, it's been done.

A really tricky problem is what to do if the local repository has transitioned, but a remote has not, and changes keep being made to the remote. What it does so far is incorporate the changes from the remote into the index, and re-run the transition code over the whole thing to yeild a single new commit. This might not be very efficient (once I write the more full-featured transition code), but it lets the local repo keep up with what's going on in the remote, without directly merging with it (which would revert the transition). And once the remote repository has its git-annex upgraded to one that knows about transitions, it will finish up the transition on its side automatically, and the two branches will once again merge.

Related to the previous problem, we don't want to keep trying to merge from a remote branch when it's not yet transitioned. So a blacklist is used, of untransitioned commits that have already been integrated.

One really subtle thing is that when the user does a transition more complicated than git annex forget, like the git annex forget --dead that I need to implement to forget dead remotes, they're not just telling git-annex to forget whatever dead remotes it knows right now. They're actually telling git-annex to perform the transition one time on every existing clone of the repository, at some point in the future. Repositories with unfinished transitions could hang around for years, and at some future point when git-annex runs in the repository again, it would merge in the current state of the world, and re-do the transition. So you might tell it to forget dead remotes today, and then the very repository you ran that in later becomes dead, and a long-slumbering repo wakes up and forgets about the repo that started the whole process! I hope users don't find this massively confusing, but that's how the implementation works right now.


I think I have at least two more days of work to do to finish up this feature.

  • I still need to add some extra features like forgetting about dead remotes, and forgetting about keys that are no longer present on any remote.

  • After git annex forget, git annex sync will fail to push the synced/annex branch to remotes, since the branch is no longer a fast-forward of the old one. I will probably fix this by making git annex sync do a fallback push of a unique branch in this case, like the assistant already does. Although I may need to adjust that code to handle this case, too..

  • For some reason the automatic transitioning code triggers a "(recovery from race)" commit. This is certianly a bug somewhere, because you can't have a race with only 1 participant.


Today's work was sponsored by Richard Hartmann.

Posted Sat Sep 14 03:11:06 2013

Started work on gcrypt support.

The first question is, should git-annex leave it up to gcrypt to transport the data to the encrypted repository on a push/pull? gcrypt hooks into git nicely to make that just work. However, if I go this route, it limits the places the encrypted git repositores can be stored to regular git remotes (and rsync). The alternative is to somehow use gcrypt to generate/consume the data, but use the git-annex special remotes to store individual files. Which would allow for a git repo stored on S3, etc. For now, I am going with the simple option, but I have not ruled out trying to make the latter work. It seems it would need changes to gcrypt though.

Next question: Given a remote that uses gcrypt, how do I determine the annex.uuid of that repository. I found a nice solutuon to this. gcrypt has its own gcrypt-id, and I convert it to a UUID in a reproducible, and even standards-compliant way. So the same encrypted remote will automatically get the same annex.uuid wherever it's used. Nice. Does mean that git-annex cannot find a uuid until git pull or git push has been used, to let gcrypt get the gcrypt-id. Implemented that.

The next step is actually making git-annex store data on gcrypt remotes. And it needs to store it encrypted of course. It seems best to avoid needing a git annex initremote for these gcrypt remotes, and just have git-annex automatically encrypt data stored on them. But I don't know. Without initializing them like a special remote is, I'm limited to using the gpg keys that gcrypt is configured to encrypt to, and cannot use the regular git-annex hybrid encryption scheme. Also, I need to generate and store a nonce anyway to HMAC ecrypt keys. (Or modify gcrypt to put enough entropy in gcrypt-id that I can use it?)

Another concern I have is that gcrypt's own encryption scheme is simply to use a list of public keys to encrypt to. It would be nicer if the full set of git-annex encryption schemes could be used. Then the webapp could use shared encryption to avoid needing to make the user set up a gpg key, or hybrid encryption could be used to add keys later, etc.

But I see why gcrypt works the way it does. Otherwise, you can't make an encrypted repo with a friend set as one of the particpants and have them be able to git clone it. Both hybrid and shared encryption store a secret inside the repo, which is not accessible if it's encrypted using that secret. There are use cases where not being able to blindly clone a gcrypt repo would be ok. For example, you use the assistant to pair with a friend and then set up an encrypted repo in the cloud for both of you to use.

Anyway, for now, I will need to deal with setting up gpg keys etc in the assistant. I don't want to tackle full gpgkeys yet. Instead, I think I will start by adding some simple stuff to the assistant:

  • When adding a USB drive, offer to encrypt the repository on the drive so that only you can see it.
  • When adding a ssh remote make a similar offer.
  • Add a UI to add an arbitrary git remote with encryption. Let the user paste in the url to an empty remote they have, which could be to eg github. (In most cases this won't be used for annexed content..)
  • When the user has no gpg key, prompt to set one up. (Securely!)
  • Maybe have an interface to add another gpg key that can access the gcrypt repo. Note that this will need to re-encrypt and re-push the whole git history.
Posted Sat Sep 14 03:11:06 2013

Got git annex sync working with gcrypt. So went ahead and made a release today. Lots of nice new features!

Unfortunately the linux 64 bit daily build is failing, because my build host only has 2 gb of memory and it is no longer enough. I am looking for a new build host, ideally one that doesn't cost me $40/month for 3 gb of ram and 15 gb of disk. (Extra special ideally one that I can run multiple builds per day on, rather than the current situation of only building overnight to avoid loading the machine during the day.) Until this is sorted out, no new 64 bit linux builds..

Posted Sat Sep 14 03:11:06 2013

I've started a new page for my devblog, since I'm not focusing extensively on the assistant and so keeping the blog here increasingly felt wrong. Also, my new year of crowdfunded development formally starts in September, so a new blog seemed good.

Posted Sat Sep 14 03:11:06 2013
Comments on this page are closed.