RFC: git going forward

kim · March 14, 2019, 11:00pm

Bonsoir,

as promised, here is the second, more tricky part: how does the Radicle project
think about evolving git (or more generally: VCS) support, and how I could
contribute to that.

To recap, we ended up with basically two issues:

Radicle (daemon) is not aware of the IPFS object representing the latest
heads, thus doesn’t pin it, thus rad replicate doesn’t replicate the git
portion of a project.
When trying to replicate git data in a “fully p2p” fashion, there isn’t much
choice than to do that on the storage layer (as devised by IPFS). The
downside is that this entails to store git objects naively as uncompressed
loose objects: uncompressed because we want to preserve the objects’ SHA1
hash on the IPFS “merkle DAG”, loose because the process by which git creates
packfiles is not deterministic, and there is no remote server with which to
negotiate the minimal delta to transfer over the network.

Unfortunately, this approach doesn’t scale very well, as even modest git
histories (nominally) contain thousands of objects.

We discussed a few potential solutions on and off, of which I think the two most
promising ones are:

a. Introduce a notion of an external reference in the language. This could be a
URL, or a hash, or a combination of both. For now, we are only considering
references to data which is stored on IPFS, and on the same IPFS network as
the RSM.

Upon evaluating an expression, the interpreter replaces the reference with
its content, deferring resolution to the storage layer. Upon persisting an
expression, the storage layer must resolve all references, and return an
error if that fails. The references are also pinned recursively, applying the
same pinning policy as for the RSM.

The git remote helper would now need to update the RSM state after a
(successful) push, ie. send an expression to the daemon. Alternatively, we
could expose the (“smart”) HTTP-based git protocol as part of the Radicle
daemon’s API. This would simplify distribution and installation, and could be
used to improve the perceived performance for the user (replication would
still be slow for large histories, but the UI wouldn’t block).

Note that this solves 1. above, but not 2.

b. We already implemented code collaboration as a patch-based workflow, where
patches are stored “on-chain”. Providing git push/pull can be seen as a
convenience. Building upon that, we could materialise a git repo by just
applying all patches in order to an empty repo, starting with the oldest one.

Although individual patches might be large, the overall storage requirements
would be reduced due to less redundancy. Again, the daemon could maintain the
git repo “behind the scenes”, and expose a read-only subset of the git
protocol as part of its API.

This solves both 1. and 2., the latter at least to some extent.

Perhaps there are further optimisations possible for very large histories.
For example, cloning a repo with 10k’s of commits (like git itself) could be
made faster by periodically storing a snapshot of the repo’s state as a
packfile on chain (potentially using the reference mechanism from a.).

I would be leaning towards b., as it addresses both concerns and is more
general. I might, however, be missing some details, or people have ideas for
alternative approaches, of which I’d like to hear.

K

cloudhead · March 15, 2019, 10:59am

I’d just like to chip in and say that by the looks of it, the IPFS team really wants git objects to be stored in the Merkle DAG as natively as possible, and for that to be efficient. So before throwing this design out for something better in the short-term, I’d try to assesss whether in the long-term, this will be the right design, and whether we would benefit from doing things the Way of the Dag, because it will only get better with time.

···

On Mon, Mar 18, 2019 at 10:33 PM Kim Altintop kim@monadic.xyz wrote:

Bonsoir,

as promised, here is the second, more tricky part: how does the Radicle project
think about evolving git (or more generally: VCS) support, and how I could
contribute to that.

To recap, we ended up with basically two issues:

Radicle (daemon) is not aware of the IPFS object representing the latest
heads, thus doesn’t pin it, thus rad replicate doesn’t replicate the git
portion of a project.

When trying to replicate git data in a “fully p2p” fashion, there isn’t much
choice than to do that on the storage layer (as devised by IPFS). The
downside is that this entails to store git objects naively as uncompressed
loose objects: uncompressed because we want to preserve the objects’ SHA1
hash on the IPFS “merkle DAG”, loose because the process by which git creates
packfiles is not deterministic, and there is no remote server with which to
negotiate the minimal delta to transfer over the network.

Unfortunately, this approach doesn’t scale very well, as even modest git
histories (nominally) contain thousands of objects.

We discussed a few potential solutions on and off, of which I think the two most
promising ones are:

a. Introduce a notion of an external reference in the language. This could be a
URL, or a hash, or a combination of both. For now, we are only considering
references to data which is stored on IPFS, and on the same IPFS network as
the RSM.

Upon evaluating an expression, the interpreter replaces the reference with
its content, deferring resolution to the storage layer. Upon persisting an
expression, the storage layer must resolve all references, and return an
error if that fails. The references are also pinned recursively, applying the
same pinning policy as for the RSM.

The git remote helper would now need to update the RSM state after a
(successful) push, ie. send an expression to the daemon. Alternatively, we
could expose the (“smart”) HTTP-based git protocol as part of the Radicle
daemon’s API. This would simplify distribution and installation, and could be
used to improve the perceived performance for the user (replication would
still be slow for large histories, but the UI wouldn’t block).

Note that this solves 1. above, but not 2.

b. We already implemented code collaboration as a patch-based workflow, where
patches are stored “on-chain”. Providing git push/pull can be seen as a
convenience. Building upon that, we could materialise a git repo by just
applying all patches in order to an empty repo, starting with the oldest one.

Although individual patches might be large, the overall storage requirements
would be reduced due to less redundancy. Again, the daemon could maintain the
git repo “behind the scenes”, and expose a read-only subset of the git
protocol as part of its API.

This solves both 1. and 2., the latter at least to some extent.

Perhaps there are further optimisations possible for very large histories.
For example, cloning a repo with 10k’s of commits (like git itself) could be
made faster by periodically storing a snapshot of the repo’s state as a
packfile on chain (potentially using the reference mechanism from a.).

I would be leaning towards b., as it addresses both concerns and is more
general. I might, however, be missing some details, or people have ideas for
alternative approaches, of which I’d like to hear.

K

kim · March 15, 2019, 11:27am

Well, yeah, maybe, although I have a hard time imagining how they want to solve that.

However, I think the more important point is that push/pull is kind of orthogonal to code collaboration - it’s just a way to store your changes somewhere else. Also, patches can be generalised to work with other VCSes.

···

On Tue, Mar 19, 2019 at 10:32 AM Alexis Sellier alexis@monadic.xyz wrote:

I’d just like to chip in and say that by the looks of it, the IPFS team really wants git objects to be stored in the Merkle DAG as natively as possible, and for that to be efficient. So before throwing this design out for something better in the short-term, I’d try to assesss whether in the long-term, this will be the right design, and whether we would benefit from doing things the Way of the Dag, because it will only get better with time.