Radicle Collaboration Tools

Couple of things Iā€™d like to bring up, though I think this conversation is going in the right place:

  • The approach @mmassi proposed reminds me of DATā€™s multiwriter proposal: DEP-0008: Multi-Writer - Dat Protocol
  • I wonder what is the advantage of merging at all? What if we never merge histories and simply display them at the application layer in a certain order? This would mean that the order can change if new comments are discovered, and itā€™s possible to ā€œfill-inā€ holes in the timeline after-the-fact, as the state converges. The way this could work would be similar to patch-based DVCSs, which apply patches in an order ā€œon-the-flyā€.
  • We would eventually like to support threaded discussions (like reddit) ā€“ it may actually be easier to model things as a tree, when it comes to comments, since it doesnā€™t require a total order. Just something to think about, as there may be a nice synergy between the product and underlying data-structure here.

Thatā€™s what I mean by ā€œvirtual mergeā€: bringing them into an order is effectively a merge operation, and it might be a good idea to commit it to disk so as to not having to walk the entire history every time. Itā€™s not clear, however, why this committed state should be announced to the network ā€“ due to git commits changing their identity when reordered, it seems to me that this creates more problems than it solves.

Ah, gotcha. Yeah it should definitely be cached, but the key property would be that it can re-organize when new inputs are discovered, so the ordering should always be a function of the total set of known inputs, and thus not be append-only.

This is always the case: for @mmassiā€™s ā€œsemantic rebaseā€ proposal to work, the issue branch would need to be rewritten every time new inputs are discovered, with the guarantee that everyone eventually arrives at the same sequence of events. This means that the git commit history is mostly meaningless, because the commits constantly change their identities. The only reason to preserve it would be to retain edits of mutable entities without having to store the history in the entities themselves.

If everything is represented as filesystem objects, the git tree would then serve as the merkle root at any point in time.

I think this is very smart, but Iā€™m worried that some interactions would require to embed state in those objects (a la state-based-CRDTs). Iā€™m also worried that, whenever new inputs are discovered, the entire branch history has to be re-evaluated (because there is no way to know where the inputs should fit in).

Ah I see, thatā€™s great! I do think that indeed you may have to re-evaluate the entire history everytime a new input is discovered (though again, maybe Pijul has some clever tricks around this) - however, if we use one branch per issue, this should be very fast.

Thatā€™s true, and thatā€™s really cool! Itā€™s still unfortunate that the git tree canā€™t generate lightweight proofs as-is, given that it isnā€™t base-2.

Related question: Is there a sort of top level root object/hash which encompasses all branches? Or would we have to construct this separately, to have a single hash that represents the whole project?

Afaiu, the trick is (quoting the pijul manual):

In Pijul, files are generalised to arbitrary graphs of lines , with different kinds of labels on the edges.

Additionally, patches may depend on other patches. So itā€™s not like patches can be applied in random order ā€” except when they are completely unrelated. Therefore, you donā€™t need to recompute everything when a new patch comes in, you ā€œsimplyā€ connect the edges and dependencies and try to find a solution which doesnā€™t result in a cycle (ie a conflict). Order can also be imposed artificially through patch dependencies (although I believe they have something else in mind for the future).

I think @mmassiā€™s intuition is this: if you can guarantee that patches are always unrelated (comment1 is a different file than comment2), it does not matter in which order you apply them ā€” the result is always a tree which represents the current state of the issue. Except that the order does matter if we allow mutable objects (text edits, emoji reactions, what have you). Iā€™m becoming more and more convinced that if we canā€™t rely on commit identity (which pijul can), we have to go down the CRDT road, which Iā€™m not so enthusiastic about.

Ya trueā€¦ maybe this point is moot then.

Well, whose branches? You can take a hash over everything under refs/ as key-value pairs pretty cheaply, but that would only be your local view of the network.

Heh, good question. Iā€™m trying to figure out what can be checkpointed on chain, need to think about this more.

What about partial/causal order, like SSB? Basically just include the id of some previous object, and it will be ordered after it. What would CRDTs buy us over this?

Yes yes. If you take a number of ā€œfeedsā€ ā€“ ie. branches with a single writer, but the ability to reference ā€œparentsā€ which may logically belong to other ā€œfeedsā€ ā€“ you can ā€œmergeā€ them into a single, topologically ordered timeline (provided a secondary total ordering like timestamps and lexicographic ordering of commit hashes as a last resort). Then right-fold some user function over this which determines the semantics of the payload (messages in SSB, something analogous in our case), and you have a ā€œsemantic mergeā€ in @mmassiā€™s thinking. The accumulator of the fold is the state of your ā€œdatabaseā€, which could be represented as just a filesystem tree, and stored separately from the ā€œfeedā€ branches.

It is a little bit tricky to make it so you donā€™t have to re-evaluate this right-fold over the entire timeline each time you discover new inputs. But intuitively, it should be possible if you checkpoint the results. Also, the timelines weā€™re talking about are scoped: to evaluate the state of a single issue, you donā€™t have to take a userā€™s entire activity feed into account.

Maybe the Merkle-DAG CRDTs are indeed the formalisation of what weā€™re talking about. I still didnā€™t have time to read the paper properly, maybe @fintohaps can give us a rundown :slight_smile:

@cloudhead:

So after finishing the paper, Merkle-DAG CRDTS can be seen as a ā€œfreeā€ structure for having ordered and immutable events. So to put it another way, it gives you a free logical clock as long as your events/payload form a partial order. All the children of the DAG happened later than their parents, since new events are added as root nodes.

The recommendation is to add an operation-based CRDT as the payload. The benefit is that you now:

  1. Have a free way of getting a logical clock for your operations
  2. Updates are de-duplicated since hashes on nodes that are equal are the same events

What this would buy us is an already formalised format for one part of our CRDTs, but weā€™d still have to formalise the operation-based semantics which we can look at git-bug for inspiration and possible automergeā€™s (paper with its sideways semantic page :smile:) .

I need to read back over the limitations again though. They tend to involve the DAG-Syncer and Broadcaster, which in our case is git, which I didnā€™t think of until after I finished that section :slight_smile:

I was going to point out merkle-clocks, but I see that @fintohaps has already brought up merkle-crdts paper.

I also would like to point out that assumption that use CRDTs on issue comments / summary implies (near) real time text editing. While that is usually the case itā€™s not a requirement you e.g. in sited JSON-CRDT library you can represent issues summary as an atomic value (which is default unless you use Text CRDT) which avoids most of the mentioned complexity.

It is also worth pointing out that mentioned automerge library gives you a way to resolve conflicts differently, meaning you can additional logic e.g. you could drop edits from non-author or present them differently (maybe something like google doc suggestion edits).

I think it is also worth considering the fact that while CRDT libraries like auto-merge use specific encoding for operations it is still visible to encode that in more human readable form like in markdown files which is from what I understand is what desired here.

I should also point out textile threads v2 paper A protocol & event-sourced database for decentralized user-siloed data that utilized merkle-clocks for more general (than CRDTs) event-sourcing (a.k.a kappa) collaboration primitive (which one could use for encoding Op based CRDTs)

I feel like weā€™re getting a bit lost in trying to describe the problem from different angles, but not solving the problem at hand.

What weā€™re talking about can be generalised to this:

foldr :: Foldable t => (a -> b -> b) -> b -> t a -> b

We know that, considering git as the storage and messaging layer, t must be a Commit, for some definition of. What weā€™re not sure about is how we go about a and b.

  • In @mmassiā€™s proposal, t a is instantiated to Commit (Tree a), where a describes tree-structured documents at the leaves. Additionally, b ā‰£ a, i.e. the output is also a committed Tree a. A nice property of this is that the git history and snapshot state looks just like any other git history, ie. git log as well as a checked-out Tree a make sense to a human.

    This is similar to the delta-state model, except that git doesnā€™t logically operate on deltas, but snapshots.
    It could still work, but the merge operation has to consider the entire Tree a at each step. It is also a problem that we cannot straightforwardly share b, because the commit identity is not solely determined by Tree a and the previous commit, but also the committer identity and local time. This is not an issue in a patch-based system.

  • In the operational model, a contains only the description of one or more ā€œpatchā€ operations, that is, git log will look funny because it will show a lot of deletions. It also doesnā€™t make sense for a human to check out this history.

    It is also unclear what to do with b. Other systems seem to resort to local storage separate from the transmission layer. This allows to use BTree-style databases for fast queries, but comes with the additional complexity of reasoning about this component (e.g. invalidating state when new inputs are discovered)

  • It is also conceivable to employ a hybrid of those two approaches: every commit contains strictly data (not operations), but only the delta-state wrt the previous commit. The output state is stored separately, comprising the entire Tree a. Iā€™m not entirely sure, but my intuition is that, if the output state references the inputs from which it was computed, we might be able to arrive at a shareable history regardless of gitā€™s hash-linked model.

I might be overlooking something, but it seems to me like the third option is probably what we want. Regardless of the choice, however, we want the reducer function to be supplied by the application ā€“ as far as radicle-link is concerned, a should be opaque. The choice of b is more relevant ā€“ if the output state is in external local storage, weā€™re done here, otherwise it should be on radicle-link to provide an appropriate abstraction.

Iā€™m not sure I understand this. Wouldnā€™t the operational model have a new blob per event? Or am I not thinking of this the right way?

I believe the point of operation-based semantics is that the state is only updated, so if you ever receive an update then your old state can always be thrown away.

Ya I guess keeping all event blobs in the tree would also work. Instead of parent commits, there could be parent blobs ā€” soā€¦ the order in which commits are applied to a tree doesnā€™t matter :thinking:

The

Yeah but you donā€™t want to recompute the whole state every time an update is received ā€” only from the most recent state just before the update would casually apply. So, the database needs to support snapshot/rollback to b at each step of the fold.

I have been considering a design like that, but I fear some consequences.

A ā€œparent blobā€ must be implemented as an entry in a tree blob, which, when checked out, would result in a directory on the file system.

This has at the following consequences:

  • Checking out one blob would also check out its entire ā€œhistoryā€.

  • Long ā€œparent-childā€ chains (long ā€œhistoriesā€) would result in very deep directory trees, leading to very long paths. IIRC there are OSs with path length limitations.

  • In git branch histories have a compact on disk representation because identical blobs are shared in the object DB. However representing a ā€œhistoryā€ like this and then checking it out would likely use a new file for every shared blob (I doubt git checkout would use hard links, and even if it did hard links to directories are not a sane option).

Long story short: the idea is clever but its real world engineering consequences could be bad.

Or Iā€™m missing something! :slight_smile:

Hmmm maybe my confusion is as to how Commits are carrying the information of a. Is the information just going to be embedded in the Commit message and we parse it out? Whether a be Tree a or some operation-based semantics.

I donā€™t think anyone proposed to use git commit descriptions to store actual data.
We should use plain git objects.

We have these ā€œbuilding blocksā€ at our disposal (note that here ā€œreferenceā€ means ā€œhash of an objectā€):

  • commits , which in turn contain:
    • one or more references to parent commits
    • a reference to an object (tree or blob)
    • the author, a description and other metadata
  • tree objects, which work like file system directories and are essentially dictionaries where the values are references to other objects
  • blob objects, which are just byte arrays; we will likely store structured data inside them, and we could even put references (hashes) there, but the git GC will not see them so we should do this only if we know that the referenced object will be kept alive through other references (inside trees or commits)

Thinking in LEGO terms, these are the bricks we have.

We generally think about using commit chains to represent histories (timelines) because this is what git normally does.

And when expressing operations in CRDTs it is generally useful to also have a reference to the previous operation (or ā€œstateā€), so that it is clear what the operation must be applied to.

So it would seem natural to express CRTD chains (timelines) with commit chains in git, except for these two potential problems:

  • looking at git diffs between commits would not make sense anymore (git expects to diff state but we are storing operations)
  • we would eventually need to reorder commits because of merges, and if the ā€œparent commitā€ represents the ā€œprevious stateā€ this reordering potentially alters the intention of the author of the given commit (it effectively changes the ā€œdeclared baseā€ on which the operation should be applied)

To avoid these issues it would be tempting to build the CRDT timeline chain inside plain references (in tree objects) instead of using the commit chain.
At the data structure level this can work, but then we would have the issues I described in my previous comment: doing a git checkout would become impractical or impossible.

Thanks for the information sharing
vmware