Radicle Collaboration Tools

mmassi · February 17, 2020, 8:55pm

Implementing collaboration tools (issue tracker, code reviews, and more) in Radicle

Collaboration data representation: the merge problem

By “collaboration tools” we initially mean an issue tracker and a code review system.

Over time more tools could be added (like a wiki, a more complex CMS, a project tracking system…).

The implementation strategy is to build these tools on top of git.

This means handling git as a database which the tools use as backing storage.

Since the tools are collaborative one of the main issues is how to avoid (or automatically resolve) conflicts at the git level when multiple users modify the data.

One way of doing it would be to use CRTDs to encode the tools data.

This is doable, and is the approachn taken by git-bug.

However this has the downside that the storage format is not “natural”: the git objects need to be operations and the semantic object content needs to be reconstructed by applying them sequentially.

This makes the implementation complex; also making ergonomic CLI UIs becomes harder.

A different approach

For this reason we are exploring another approach: using barebones git plumbing commands as a data storage and transmission layer and resolving eventual conflicts at the application level.

To understand this better, let’s consider this scenario: project P has two users, U1 and U2, each with his own view of the project (P1 and P2).

At some point user U1 creates an issue (I) with a comment C1.

The issue is therefore committed by U1 in P1 so we can call it P1/I.

User U2 tracks P therefore they will get a copy of P1, which means they will see P1/I. They will then “merge” I inside P2, including its comment C1.

Then U2 adds another comment to I, C2 and pushes it into P2.

Eventually, thanks to replication, user U1 will also receive P2 and therefore P2/I, and they will be able to merge C2 into P2/I (no conflict here).

Now, imagine that U2 tried to modify comment C1. This operation would be illegal because C1 has been authored by U1. In principle the Radicle UI should not allow U2 to do this but let’s say they did it anyway. At this point U1 will see that P2/I has C1 in an invalid state (edited by U2) and will refuse the merge.

The general observation is the following: in git the stotage content is supposed to be source code, which can in principle be modified by anyone and is merged with a line oriented algorithm.

However when implementing a collaboration tool (like an issue tracker) there are “business logic” rules about modifying data that should be implemented and enforced. Therefore it makes sense not to use the git merge algorithm and instead implement an ad hoc semantic merge that will abide to those business rules and ensure that each participant observes them.

When handling the branches that store “collaboration data” Radicle should get fetch them but never git merge them: the logical merge should happen at the application layer and the resulting data should be then committed to the local repo.

The intention is to have a merge algorithm that never generates conflicts.

This is easy to achieve if every individually editable item is never merged “partially” (like source code, line by line).

Instead the semantics must be that the new revision wins over the previous one and replaces it entirely.

This works well if the items that can be modified are kept small, like individual comments inside an issue.

And note that in an issue comments are generally apended, with no conflict at all; we have potential conflicts only if a comment is modified.

Moreover, in most cases each “editable item” should have a single author who can modify it, making the risk of conflicts even smaller. For instance, in an issue tracker the only “shared editable” items are an issue description, its title, and its set of labels (where each label counts as an individual item).

Data propagation in the P2P mesh

The idea is that inside a radicle project collaboration data is a sort of “shared CMS”, viewable and editable by each Radicle user (but of course project members could have different permissions on specific items).

While for source code merge operations are explicit, for this CMS they should be implicit and happen as soon as a git fetch operation receives new data from a peer, provided that the received data abides to the business rules.

The radicle P2P gossip replication (based on git fetches) will disseminate the data on all repositories, and since the merge algorithm will be deterministic and conflict free the system will be eventually consistent because it should naturally converge to the same state in which each user has the same data (including every update from every other user).

In practice these “automatic merges” will move data from the remote branches into the local “master CMS” branch, which will in turn be mirrored into the remote branches of other peers.

Actual primitives

As a sort of guideline, we could use the following building blocks to represent collaboration data:

As leaves, files containing a combination of YAML, TOML, JSON and Markdown text (whatever we find more ergonomic to represent the needed data). These files could represent, for instance, issue comments or descriptions. Ideally each individual file should be an atomically editable entity (as described above when modified they will not be merged but fully replaced). Importantly, each of these items could be signed to prove its author (see the discussion about identity for a reasoning about signatures).
To contain leaves, “tree objects” (directory trees) with a specific structure. for instance, the comments of an issue could be files with numbers as names all collected in the same directory, and the issue could be a directory with a file named description and a directory named comments).
Groups of objects that should be generally downloaded together should be collected in a tree pointed to by a specific branch. For instance, each issue should have its own branch that represents the evolution of the issue, and to download the issue it would be enough to fetch that branch.
A “section” of this “shared CMS” would then be a group of branches, easy to organize because branches can be namespaced; for instance, the issue tracher could be implemented by branches with the name radicle-cms/issues/issue-xyz.

kim · February 18, 2020, 4:54pm

I very much like the idea of decoupling conflict resolution (i.e. merging) from the storage format. Also, a locally-committed history seems attractive – much like in patch-based VCSs.

I’m not sure, however, I understand how ordering is represented. Don’t we need to embed some kind of references into the leaf datatypes to be able to recover causal ordering?

For example:

Let {C1, C2, C3} be comments in issue I, each made at points in time {T1, T2, T3} respectively. Due to the system being eventually consistent, a local merge may see any of the comments in any order. If later ones reference earlier ones (in the payload), it will be obvious to a human when something is missing, but not to a machine. The latter is however desirable for UX reasons, recovering a hint as to how to resolve the missing piece of data on the network, and to be able to construct a (merkle) proof of the state of the object group (e.g. when submitting a checkpoint on-chain).

It’s entirely possible to embed such pointers into the leaf objects themselves. For example, we could say an issue comment is represented in a structure like this:

{
    "issue": "rad://<project-id>/issue/<issue-id>",
    "in-reply-to": "rad://<project-id>/issue/<issue-id>#<comment-id>",
    "comment": "lorem ipsum"
}

Now, how do we construct <issue-id> and <comment-id>? They need to somehow be stable, so they don’t change when the local commit is made. They also need to be content-addressable – say I come across C2, which references C1, but I’m not directly tracking the repo of the author of C1. The author of C2 must have seen C1, though, so I should be able to find the object in my object database – if the tooling to create a comment ensures that such links are included in the git tree object being committed.

fintohaps · February 18, 2020, 5:22pm

I’m also coming to the conclusion that a separation of conflict resolution and storage format would be a good way to go. But I do think that they need to meet to get the useful semantics we want.

My current thoughts, after bouncing some discussion off of @xla, are that any type of “collaboration” we work with there are two things that exist:

An artifact (or multiple artifacts, e.g. commits and comments on a Revision/PR)
A CRDT rule(s) to resolve the ordering of these artifacts

I kind of see these things as “Resolvable Timelines”. So to demonstrate here’s some examples of what I mean. I’ll try and highlight artifacts and terminology in bold

Issue

We have the root of the issue which is first created by an author and has a title and a body. The root is then followed by comments by users. We then pair CRDT rules with title and body, for example title would be last-write-wins, body would be something akin to Google Docs, we want to merge freely if pieces of text don’t affect each other, otherwise handle the conflict as best as possible. The comments are ordered by some lamport-clock logic and a timestamp, akin to git-bug.

Revision

Same as an issue but comments can include code, commits are also included in the timeline

Project wiki

The artifacts are pieces of text and the CRDT rules are, again, akin to Google Docs.

etc.

This is a good point and I think this means that we need have causality rules (vector clocks) when it comes to referencing artifacts. This means that if C2 references C1, the user pulling C2 will see they are missing an entry for C1 and will have search for it to be resolved (possibly having fall back logic if it doesn’t get resolved quick enough).

Also a great point I agree, I think references should be transitively replicated if they aren’t local to a project.

Some links of my research into CRDTs:

CRDTs and the Quest for Distributed Consistency by Martin Klepmann
CRDTs explained - supercharge your serverless with CRDTs at the Edge
Got a quick overview of CRDTs from wikipedia Conflict-free replicated data type - Wikipedia

kim · February 18, 2020, 10:06pm

This is actually a quite tricky problem in general. We don’t have to support real-time collaborative editing, and there are very few objects which should be mutable (and then mostly only by the author). So why not fall back to just a normal three-way merge (or whatever the underlying VCS is capable of)?

Within the context of a single “section” (issue), I think it is almost exclusively about parent-child relationships — surely any URL could be embedded, but then it would be up to the application layer to interpret them. Comments, attachments, or pointers to source locations should converge on a lower layer, imho. Using git primitives, there is simply no way to reference something which doesn’t exist, and that something will be published along with the branch the reference was made on.

Put differently: some body element may contain any kind of link, and it would be up to the rendering engine to decide what to do with them. We could introduce special link URIs to point to objects in the current tree, e.g. [click here](oid:asdfqwerdeadbeef), but there’s no guarantee this object exists (unless we parse all body elements for hyperlinks, and discard the edit if we find dead ones). That’s fine for attachment data, but not so fine for causal ordering or referencing source code.

Otoh, a commit parent will establish an ordering unambiguously, but then the idea of locally committing edits made by others becomes messy: every time collaborator A commits something from B, B will have to commit that commit, and vice versa ad infinitum.

That’s why I was thinking of this in terms of the SSB model: everyone only ever commits to their own branch. For these special branches, we would reject non-fast-forwards, so it’s log-structured if you wish.

Any update with in-reply-to semantics would add an additional commit parent referencing the last-seen state of the item being replied to whether or not this commit is on the local or a foreign branch. Viewing the issue would entail finding all remote tracking branches with the same name (minus the remote’s name), and performing an octopus merge with all their heads as the parents. Here, we need to ensure that this never creates a merge conflict proper – which is where @mmassi’s semantic merge rules come into play. The difference would be that this would only be a “virtual” merge – the merge commit would never be published to the network. Only at the very end of the issue lifecycle – when the issue is closed – a merge commit would be created, which is only allowed if done by one of the project maintainers. This merge commit would serve as a merkle root for everything that has been said on the topic and was acknowledged by the maintainers.

mmassi · February 19, 2020, 9:20am

This captures the spirit of my proposal.

I would try to stay away from CRDTs if at all possible, not because I don’t like them (they are lovely!), but because of their complexity.

That complexity is needed in a real time collaborative editor of something where merges are not trivial.

IMHO we are not in that case.

And then there’s the problem that with CRDTs you don’t store a plain “expanded” entities anymore, you only store operations, but in a UI you still need to represent the entities, which adds even more complexity to the implementation (which is why git-bug has a caching layer for the expanded representations).

In an issue tracker probably the only entities for which some form of collaborative editing would be desirable are issue descriptions.

Even there if we use markdown either the trivial “last version wins” or a line-by-line merge strategy would work well enough, the complexity of CRDTs is not justified, at least not at this stage (IMHO).

mmassi · February 19, 2020, 9:22am

We have the problem that IDs need to be unique without a centralized ID generator.

Note that in this context IDs need only to be locally unique.

Like, issue 1 of project foo does not collide with issue 1 of project bar, and at the same way comment ``C1of issue1does not collide with commentC1of issue2`.

In general, if we represent these “local collections” as directories, these IDs should be unique just like filenames inside a directory.

We also have the requirement of signing entities, which leads to the requirement of timestamping them (see the revocation issue on the identity thread).

My proposal is therefore to use the creation timestamp as a starting point for the entity filename.

This has the added benefit of sorting naturally by creation time, which is what we need for issue comments.

Of course since the system is decentralized there is still the possibility of having timestamp collisions.

I see two main ways to deal with this:

We add a suffix that makes the name unique; the most effective one would be a cryptographic hash of the initial entity contents.
We handle collisions at the semantic merge level and add simple, ad hoc short suffixes there.

The second solution is more complex to implement but the resulting filenames would be much more ergonomic.

We should handle this specific tradeoff in a separated thread-discussion.

About references to issues not already obtained, see my next post.

mmassi · February 19, 2020, 9:30am

For each project we have a “CMS” section of the repo that stores the collaborative data.

The model I have in mind is the following:

Each user commits to his own rad-cms/master branch to mutate the CMS (add an issue, add a comment, edit a comment…).
Each user tracks, directly or indirectly, every other other “cms” instance of the same project using git fetch operations and storing them in the respective remote branches.
The radicle tooling applies the semantic merge so that the contents of the remote tracked repos are applied to the local “master” cms.

Operating like this it is not possible to have a revision of the CMS with “dangling references”.

If user U adds a comment C2 referencing comment C1, well, this user needed to have comment C1 in its CMS in the first place.

More precisely: users “perceive” the work of other users only after the semantic merge operation has copied them from the remote tracking branches into their “master” branch.

Then each user add their work on top of that, and this is subsequently propagated to other users (SSB style).

Dangling references are not possible.

However there are still two separated “grey” areas here:

SSB Propagation Stability

In point 3 above, how can we make so that the semantic merge result converges so that the SSB gossip stops because everyone agrees on a final result?

The goal is to avoid the case where each node keeps adding “merge commits” that are in turn replicated and applied without adding more contents.

I start with an observation: in this CMS contents matter more than history.

More explicitly: let’s say in an issue two independent comments are added, C1 and C2, respectively by users U1 and U2.

The order in which they were added does not really matter.

What matters is that the resulting issue has both of them.

We will likely represent them in timestamp order (and giving them names that start with their timestamps helps with this).

But in principle, the history of the CMS for use U1 could have C1 first, and for U2 it could have C2 first.

Both histories should be perfectly valid.

Here that fact that we are assuming git as the underlying storage confuses the problem because in git commits are not commutative.

This gives us two choices to make the semantic merge converge to a stable result:

Explicitly ignore the history, and rely on the git tree hash to decide if some more operation is needed or not (pure content addressing).
Make so that the semantic merge also applies a “canonical rebasing”, so that different independent commits will always have a deterministic order in the history.

Both approaches can work, both have pros and cons… the second one is more “git friendly”.

SSB Propagation Scalability

In point 2 above, how can we logically track every other project instance and scale reasonably?

More precisely: for a project with 10k active contributors, how many copies of the CMS instance do we store on every device?.

How many “remote tracking branches” does each device need?

Perhaps the right answer is “all of them”, because in the end every contributor’s work must be properly tracked.

The space requirement will be minimal because in the underlying git object DB all those trees will be essentially the same (one more reason to apply the “semantic rebase” when doing the semantic merge: not only the git trees will converge, but the git histories will converge as well!).

The tricky part will be making the SSB-like gossip work with a reasonable number of connections, propagating CMS changes through a sparsely connected graph of contributors instead of trying to connect each contributor with each other with a direct network link.

fintohaps · February 19, 2020, 9:59am

I don’t buy the argument that they’re complex From what I’ve seen they avoid complexity by focusing on required properties. Since CRDTs are broken down into two types there are two sets of properties:

Operation-based
- Must be commutative
- Not expected to be idempotent, so duplication has to be avoided (probably the most complex part)
- Are update-only state
State-based
- Must be commutative
- Must be associative
- Must be idempotent
- Monotonically increasing state

These properties allow us to effectively test anything we use/build. So we’re guided, avoiding complexity.

I also don’t think it’s true that they are only needed for real time collaboration. They’re useful for any kind of distributed/decentralized collaboration where we expect an eventual [strong] consistent state. Which I think is where our use case lies.

Having a look at issues on GitHub, users may edit fields such as labels, assignees, the title, the body, adding reactions to posts. So I think there’s a few hidden details that we can consider “editable”, other than the description.

I do agree here that a lot of cases can be solved with last-version-wins or using git’s merge strategy (although the case of diverging edits is a bit of a puzzle…).

All in all, I don’t think we should write-off CRDTs because I don’t see their complexity, but maybe you can show me the way in which they are

kim · February 19, 2020, 10:14am

These can all be modelled in an append-only fashion. I also don’t think it improves usability overall if we present a merge conflict if admins A and B made concurrent, conflicting edits to an issue body.

They are not easily generalisable to arbitrary data. Particularly free-form text would require to re-invent pijul on top of git.

alexgood · February 19, 2020, 10:31am

I’m mostly just lurking here (and finding the conversation fascinating) but I feel like I should jump in with a quick note on CRDTs. I’m currently working on a rust implementation of Automerge, which does allow you to synchronize data of an arbitrary shape.

fintohaps · February 19, 2020, 10:35am

Hey @alexgood! I was actually lurking on your project yesterday after watching Martin’s talk about automerge What’s the state of your project? I got the feeling it was pretty early stages, but I’d love to take part even if we don’t go down a CRDT path in the end

fintohaps · February 19, 2020, 10:46am

And what if someone removes a label or reaction? Or is that append-only in the sense of the operation that happens? Like add :eyes:, remove :eyes:, for example?

Fair, but they seem to be well defined for JSON from what I’ve seen. Free-form text is trickier for sure. I wonder if there’s some kind of sweet spot in there for us.

alexgood · February 19, 2020, 10:52am

It’s very much a WIP but I’m working pretty hard on it at the moment and it looks like I’m going to be able to put in a few months of full time work on it with Ink and Switch (the folks who have developed it so far). We’re working on building a rust implementation which can be shipped as a WASM package for the javascript implementation and which can be used via FFI from other languages. This should make it easy to add language specific frontends without having to re-implement the logic of the CRDT everywhere.

I’m still working on timelines but my objective is to build an application in Kotlin and Rust which talk to each other over (probably) libp2p and synchronize changes with Automerge within the next few months.

But yeah, I worry that I’m hijacking this quite important conversation, I’m in the Automerge slack if you want to know more and also on Matrix as @alex:memoryandthought.me

kim · February 19, 2020, 11:43am

I’m still not convinced, maybe you can elaborate more?

Say we have .rad/issues/issue-42 (the monotonicity of issue numbers is not of concern here).

The final tree looks like:

|-- body
`-- comments
      |-- comment-1
      |-- comment-2
      `-- comment-3

Let’s say the semantic merge requires the local history to be linear (i.e. every commit can have only one parent). The comments are each introduced in separate commits.

Say we have two collaborators on this issue, A and B. A creates comment-1, which B receives, applies to her local history, and starts writing comment-2. Meanwhile, A has written comment-3 before receiving comment-2.

So, the histories would look like:

A: a,b
B: a,1

We would like them to converge to:

a,1,b'

This could work, provided:

We preserve all commit headers, and strip any commit signatures
The ordering is stable.

Say comment-2 and comment-3 have the same timestamp incidentally. Since they also have the same parent, it is impossibe to tell if the commit sequence should be a,1,b' or a,b,1'.

There is also the issue that inclusion proofs must now be constructed from scratch – we do want this as a means of attribution (“I contributed to project XYZ. qed”). You can argue that we have to do this anyway, e.g. because pijul doesn’t provide this directly, and that it shouldn’t be too expensive because items like issues are small in general.

Only the ones you choose to. An edit to an issue must include its entire history as seen by the editor – whether this is through commit parents or via history rewriting doesn’t matter. So if A tracks B, and B includes an update from C, A will receive all data (also if C includes data from D, which neither A nor B track).

Iow, the object dbs will eventually converge, but not necessarily the remotes.

The tricky part is “only” how A becomes aware of E, who is completely disconnected from the social graph. The only way I can think of to solve this in an asynchronous network is to introduce friendly supernodes who will bridge the gap between strongly connected components of the graph. I think this is fine, given that, in our domain, it is actually a feature to let A (a maintainer) decide if they want to pay attention to random person E.

kim · February 19, 2020, 11:46am

Well, you can only add or remove your own eyes, so to ensure remove doesn’t come before add, the problem can still be reduced to a simple parent-child relationship.

fintohaps · February 19, 2020, 11:53am

:blind:… So this “parent-child relationship” is still an append-only sequence of events though ya? Or maybe I’m missing your point (likely).

kim · February 19, 2020, 11:53am

Do you have some material on how automerge differs from the JSON CRDT paper? The README says this will be published “in the future”…

kim · February 19, 2020, 12:03pm

Yeah, like a linked list: remove : add : []. If we could both edit the same thing, it’d be more complicated, because we could end up with two concurrent add and remove operations and need to decide which one wins.

fintohaps · February 19, 2020, 12:07pm

For organisation of the work going forward I see different tasks this can be broken down into:

Data definitions - Defines the data types, starting with issues
a. This is useful for local testing (can be stored in a temp directory for now or something along those lines)
b. Will allow for create, edit, view of issues
Storage - How do we store this data in the git model, can it be generalised (open question)
Tracking & Networking - How does the tracking work, how do we replicate these to others and receive replicas from others
Merging - How do we merge related data into a consistent view

Does that make sense as a high level set of tasks that we can start carving out and collaborating with application on?

fintohaps · February 19, 2020, 12:12pm

Ok I’m with you. The editing I was referring to here was the fact that these add and removes are actually viewed as a counter in an emoji map.

kim reacts with eyes and the counter is 1, fintan clicks and agrees this is very eyes worthy upping the counter to 2, but then kim revokes his eyes and the final states ends with a counter of 1.

That was the point I was trying to make. There are metadata in issues, PRs, whatever that are shared and can be edited by users, which we would like to have reach a consistent state via some sort of set of rules, be it CRDTs or whatever we decide to use.