CoCo Jammin' on DAGs

fintohaps · February 28, 2020, 9:09pm

Continuing the discussion from Radicle Collaboration Tools:

Jammin’

The @coco team decided to have a jamboard session over hangouts. We wanted to flesh out the matters around representation and consistency of users’ issue trackers. No easy feat as you may be able to tell from the previous back and forth in the above conversation.

In this thread I want to cover what we discussed and the outcomes so that we can further drive the conversation forward and help inform the final implementation. We want to focus the conversation on these new outcomes.

Our conversation started off with an overview of what ideas we have explored, whether it was on discourse or over hangout sessions. The overview is shown in the image below:

We split the conversations into “Single Feed” and “User Feed”. “Single Feed” is where we were exploring the options of CRDTs to create and maintain an issue feed.

On the flip-side we have “User Feed”, which is more reminiscent of SSB’s social feed. Each user holds their own version of the feed, merging in another feed as they discover them. The user’s define their own terms of what they see and merges happen based on a timeline using references and timestamps.

In a user based feed the model in the .git database would map to a branch per person. So for example, in my .git folder I could see the following if I was following massi and kim:

# My issue tracker
refs/remotes/heads/.rad/issues/issue-1234

# My copy of massi's tracker
refs/remotes/massi/.rad/issues/issue-1234

# My copy of kim's tracker
refs/remotes/kim/.rad/issues/issue-1234

You get a Merkle! And You get a Merkle!

We noticed that there was a relation between this user based model and the Merkle-CRDT model posted previously:

So we decided to explore what it would look like for a user based feed as DAGs.

User’s each have their own DAG of commits that represent their view, and by merging them based on the rules outlined in the paper we should, theoretically, always arrive at a consistent view of the issue. This view would be a “Staging View” that is built from the individual feeds. Let’s break this idea down a little further with some more drawings!

On the far left of the diagram we visualise what an issue thread might look like. To personalise the story we have k, m, and f for Kim, Massi, and Fintan. The node in the graph m1 denotes Massi’s first comment. The Initial_k comment is the start of the issue and the main thread.

The main thread is made up of Initial_k, f1, m1, m3, and k3. The first thread is branched off of f1 and is followed by k1 and m2. Finally, we also note that m2 references m1.

In the bottom right we show the cluster of k1, m1, m2, and m3. The point here is that both m1 and k1 are both parents of m2. k1 is a parent because it happened before m2 in the thread. m1 is a parent of m2 because m2 references m1 and this implies that m1 must exist beforehand. m1 is the parent of m3 because they appear in sequence on the main thread.

In the next pane over from the issue graph we have a timeline of the comments made by each person where the grey, dashed lines denote the relationships across the user’s timelines. This shows that a user’s timeline could be their own DAG of commits, the user based feeds, referencing other user’s commits when they are aware of them.

Thus, the timelines on the right of the diagram would be merged into the “Staging Branch”, which is our graph on the left of the diagram. The merging rules should be simple. Comments will be ordered by their insertion into a thread. If there are references from other comments they will be linked together and create a causal order. If comments happen concurrently under a thread comment then these will be merged either by causal order or fall back to total order e.g. based on timestamp. We don’t mind the order in this case since the chances are that the comments will be in context of the existing conversation anyway.

At this point we marry the DAGs with an Operation-based CRDT, as recommended by the previously mentioned Merkle-CRDT paper. And when we look at the leaves of the graph we would produce a snapshot of the total state, and this would form the view for the application.

Outcome

To wrap up the outcomes of the discussion were the following take homes:

We will have one feed per user
A feed will consist of operations that will form an op-based CRDT
We merge feeds into a staging based on two rules in order:
1. Causality
2. Fallback to total order e.g. timestamp, lexicographical order of hash
We create a snapshot of the merged view
- Operation semantics will be needed here
Two merged views can be easily compared by their root hashes to see if they are equal and if not a user can identify which comments they are missing.

The protocol will only manage the DAGs and merging, whereas a layer above that will define the op-based semantics.

Follow Up

The next stage of research will be around the operation semantics. We will look at git-bug and their great work in this area for inspiration while also being informed by our current research in CoCoDa: Issues 🗒 #codecollab.

In git-bug we’ll specifically be looking at what they are doing for their caching layer, which will help us think about our snapshots and also the query types they found useful.

gozala · February 28, 2020, 10:51pm

Thanks for posting this summary @fintohaps, I like the direction it’s headed. I do however have few questions and remarks:

I see how all this would work if all participants pull from each other. However what I do not understand how does say stranger report or comment on issue ? I am particularly curious about it as it has being open problem in Dat / Beaker ecosystem specifically around dat://fritter.hashbase.io/ where new users posting messages would find themselves in echo-chamber since their “user feeds” were not know to existing users.
It seems like it would be very useful if updates from all participants could be fetched from the single participant. Unless I am missing something here, feeds being in different remotes does imply that it would require fetching from each one separately. Is that a case or am I missing something ?
It is worth considering that user:repo relation is 1:n at least that is the case on github. (There is an argument to be had here that m:n relation might be better, but let’s box that for now). That is to suggest that it might be better to represent a user as repo, that has branches corresponding to feeds per project. This would allow user interface to show user activity beyond a specific project. It also would allow to have say a “user profile” branch for data that is shared across different projects user is participating e.g profile photo, public key, (keybase like) identity proofs, etc… For what it’s worth it is also a direction that beaker is headed.
What happens if user for whatever reason chooses to rebases user feed ?
Is any of this going to use cryptography to provide Zero Knowledge Architecture (ZKA) so that you could have projects which use public infrastructure but maintain privacy ? For what it’s worth that had being one of the primary goals of mine when collaborating on textile threads and I would be happy to discuss this further if relevant.

Thanks

kim · March 1, 2020, 10:17pm

Great summary @fintohaps! I’m frantically researching stylus options for our next jam.

@gozala:

Our replication model is based on SSB: data travels along the edges of the social graph. This is remarkably effective, and it is trivial to reason about the economics, compared to DHT-based systems f.ex.

Just like every other peer-to-peer network, we need entry nodes for new participants, bridge nodes between disjoint subgraphs, and potentially supernodes which not only aid in replication, but also provide bandwidth in a reality where home DSL lines have their egress capped at ridiculously low rates. I don’t see this as a limitation – much the opposite: interesting economic and social models can emerge if we include the possibility of several network topologies – ranging from meshed through federated to closed – to co-exist.

Also note that we are not changing the wire protocol of git at all: any git hosting capable infrastructure can contribute to data availability and bandwidth. Additionally, any protocol which can be mapped to git via its transport hooks (e.g. remote helpers) can be used in conjunction with radicle.

I would also like to reiterate what I said elsewhere many times before: it is a feature to be able to ignore other participants, both from a scaling and a sybil mitigation perspective. Because our system is not inherently centered around users, but projects, we can apply such preferences at much finer granularity (user, user/project, user/project/thread, or any combination thereof).

git remotes are not normally considered by the git pull porcelain – because, in the traditional client-server model, git servers do not themselves have remotes. Remote tracking branches are, however, advertised in the upload-pack / receive-pack protocols, and can trivially be fetched just like any other branch. We use this to replicate remotes up to a factor of n describing the maximum distance in the follow graph, just like SSB. Since we are, again, not centering around users, n is not necessarily a global system parameter.

Perhaps you want to reconsider the second option: this is how many are using the centralised platforms today – and is absolutely foundational from a privacy and information sovereignty perspective.

There is also a scaling consideration: there is no good reason the protocol should impose downloading unrelated user activity, when the expressed interest is in a project, not a person. Have you ever re-opened your Patchwork installation after some days or weeks of inactivity?

That being said, there is nothing which prevents you from ascribing personal feed semantics to a project. This has several UX benefits, including (but not limited to) discovery of content and activity through a small number of well-connected users (see also the first point). Please note, however, that these semantics are defined entirely on the application, not the protocol layer.

If retracted events have been strongly linked to from other feeds (ie. via git parents), they will continue to exist. Otherwise, we may reject non-fast forwards for branches which have special semantics ascribed to them, but that’d be only conventional and wouldn’t even detect every history rewrite.

In other words: finality is where radicle-link ends, and radicle-registry starts.

I would suggest to start a new thread on this complex topic, and to put a little bit more substance behind it. In particular, I would like to understand:

What are your use cases?

Please be precise – recall that we’re exploring a well-defined use case already, that of code collaboration. Although there are elements from each which might seem similar, we’re not designing a messaging system, file storage solution, or microblogging network.
What are the security properties you’d expect from those use cases?

Again, please be precise – I have difficulties with the undifferentiated use of the term “zero knowledge”, as it misleadingly treats very different classes of cryptosystem as being the same somehow, and even suggests generic applicability of such systems. Privacy in particular is also not a binary property of any cryptosystem, and very much depends on the concrete use case.

Please also consider real world computational resource constraints. If I have this amazing MPC, but every interaction takes several seconds at least, what are the use cases exactly which would justify going through such pain?
What economic model are you basing the claim of public infrastructure on?

cloudhead · March 5, 2020, 12:02pm

Great progress – if I understand, we are using SSB’s user-feed model in the underlying layer – ie. single-writer feeds, combined with CRDTs at the “merge”/“application” layer, though if I understand correctly, the individual feeds are designed with the merge layer in mind. I’d be curious to see what these ops look like, because it sounds like the user feeds themselves are also modeled as CRDTs.

Another question I have is: where does the Merkle come in? Does it basically allow us to prove membership of an object within the thread?

kim · March 5, 2020, 1:10pm

I think we’re still weighing “plain data” (where the interpreter chooses the op semantics) vs. ops + data (a la git-bug): Issue Tracker Operations · Issue #54 · radicle-dev/radicle-link · GitHub

I think this particular type of query would still be better served with a base-2 merkle tree computed on the output tree. Perhaps we should find a place to capture what properties we exactly want – I’m thinking that we could potentially abuse commit headers to stick the root hash there, and so have a link between the git commit, its normal tree object, and the base-2 representation of it, which can be verified locally.