Upstream collaboration with event logs

geigerzaehler · April 29, 2022, 2:59pm

I would like to shed some light on how we intend to implement peer-to-peer collaboration for shared entities like patches with event logs.

An example of collaboration (and our motivation) is users commenting on patches and updating their status (e.g. closed, reviewed, merged). Our intention is to implement this functionality using event logs.

I’ll give a brief overview in this post and will follow up with more details and updates in replies as things evolve.

Event logs

We take inspiration from Secure Scuttlebutt which uses feeds to implement a peer-to-peer social network and event sourcing.

Every action a peer takes on an entity will be represented by an event that is stored in the Radicle monorepo (more on the storage below). These events are replicated between peers and allows peers to construct a meaningful representation of the concerned entity by iterating over all relevant events.

For instance, if a maintainer wants to mark a patch as reviewed, they will publish an event conveying that intention and referencing the patch. Any peer that replicates that event can show that the patch has been reviewed by the maintainer.

The maintainer may also revoke their review by publishing another event referencing the previous event. Other peers will see this event, realize that it is more recent than the previous event that marked the patch as reviewed and display the patch as not reviewed.

Event ordering

In many cases it is necessary to order events to determine the state of entity. For example, in case of multiple events closing and reopening a patch we want to know which event is the most recent to determine whether a patch is open or closed.

Similarly, we want to put the comments on a patch in the order in which they were created.

We can always put a total order on events by including a timestamp in the event data. This works well for a lot of cases but the problem is that these timestamps are generated by the peers and cannot be verified by other peers. This may allow peers to manipulate the timestamps and ultimately how state on other peers is presented.

To defend against this events will references one or more previous events by hash.

Storage

The simplest way to represent hash-linked events in Git is through commits where each event corresponds to a commit. Whenever a peer publishes a new event they create a new commit that contains the event data and links to the commits for previous events. We’ll discuss how these commits are referenced in the next section.

We could store event data in trees associated with commits or in the commit message. The former may provide more flexibility but we don’t know of a concrete advantage now. The latter is a lot less complex.

For the encoding of events in commits we have the following requirements:

Self-identifying: The commit should contain a hint that conveys that it is encoding an event.
Upgradable: If we change the encoding, new code should be able to handle existing event logs with the legacy encoding.
Compatibility with git tooling. The encoding in commit messages should not break existing git tools. Ideally we can use existing git tooling to debug event logs.

Our proposal for a commit message is the following

radicle upstream event: <event type>

content-type: radicle-upstream-event.v1
content: <JSON encoded event envelope>

The first line will be ignored when parsing. Its there to help with debugging with Git tools.

We’re using Git trailers to encode structured data in the commit message. We’re using the content-type field to identify the encoding of content. This allows us to change the encoding in the future (to base64 CBOR, for example).

References

Since events published by peers are stored as commits we need to make them accessible through Git references.

We propose the following scheme for references in the monorepo for events published by the local peer:

refs/namespace/<project-id>/refs/upstream-events/<topic>

For remote peers we use the following scheme:

refs/namespace/<project-id>/refs/remotes/<peer id>/upstream-events/<topic>

This means peers have different event logs scoped by project and a topic. We use scoping by project since replication is currently on a per-project basis so we can replicate events with few adjustments. Scoping by project and topic also allows peers to selectively replicate events they are interested in. For example, to get a peer’s comments on a patch a peer does not need to replicate all the events the peer ever published.

Authenticity

We probably want to sign individual events instead of relying on rad/signed_refs. I’ll leave this for a follow-up post.

fintohaps · May 3, 2022, 10:49am

What happens in the case of events that reference the same previous events? For example, two peers propose a conflicting event: p1 marks the patch as merged, p2 marks the patch as closed (without merge).

How do you handle an event commit referencing a commit that you have not replicated? Say, by a peer outside of your tracking graph.

I’m also wondering how this model interacts with your patches implementation, as in how are they bridged? Does it all happen at the application layer?

geigerzaehler · May 3, 2022, 1:42pm

These are some notes from our discussion.

Signature and replay attacks

To prevent replay attacks the signed payload must include the previous event. To prevent replay attacks between different topics, the first commit in the history of a topic references the name of the topic and the identity it concerns.

Race conditions

We need to avoid race conditions when publishing an event. Assume that we have two process or requests that publish an event concurrently. We want to avoid a situation where we publish one event to the network and immediately after that publish the second event but the second event does not reference the previously published event but a common ancestor. This would mean that pushing the reference for the event log topic would not be a fast-forward.

To the best of my knowledge libgit2 uses file locking for atomic updates to references. We need to investigate which APIs we would need to use for this.

`notes` directory

Now librad signs all references irrespective of their directory. This means there’s no need to put the event log references under refs/notes. I’ve updated the initial post to use refs/upstream-events.

geigerzaehler · May 3, 2022, 1:53pm

This is a question we want to answer at the application level. For instance, for a single maintainer, only they can merge and their status will override the closed status by p2. For multiple maintainers we could use a quorum or consider the latest state as canonical.

The reason why we separate this from the storage layer is that this allows us to update the logic and presentation without changing the data that was stored. We need maximum flexibility on how to fold the events into a state because this is complicated to get right, will evolve quickly and is highly dependent on the use case.

Since we’re using git fetch under the hood all referenced commits should be replicated, even if you’re not explicitly tracking the peer.

At the moment the data is completely separate and only the frontend puts it together to present it in a unified manner. For upcoming features we’ll need to think about how to store patch information in events. This will likely result in existing patches being lost unless we can come up with a cost-effective compatibility layer.

rudolfs · May 3, 2022, 2:22pm

There was also a point raised about storing the JSON in some canonical form to prevent different renderings of the same data to produce different commit SHAs. E.g.:

{a: 1, b: 2}

vs

{b: 2, a: 1}

fintohaps · May 3, 2022, 3:03pm

But what does “latest state” mean if they were concurrent? I imagine you would just end up reporting a conflict that needs to be resolved by human interaction.

Ya this makes sense and have heard similar reasoning in discussions before. As far as I remember, the problem then becomes “what’s the canonical fold?”

Ah, so when you link these events, the previous commits become parent commits?

fintohaps · May 3, 2022, 3:05pm

link-canonical (or librad::canonical) has a canonical JSON type for this

geigerzaehler · May 5, 2022, 9:48am

In this specific instance we could consider the the most recent (topologically) commit that is an ancestor of all maintainer commits. That means we wouldn’t consider any state that the maintainers have not resolved yet. Or, as you said, we could tell the user that we have concurrent updates.

Yes. I just realized that I did not explicitly state this in the initial post: To enforce causal ordering events will reference previous events by making the event’s commit a parent commit.

geigerzaehler · May 5, 2022, 9:53am

I don’t think this is an issue for us. I can’t see why we would need the same data to have the same commit SHAs. Even if we have the same JSON data the commit hashes will be different because the event will have a different parent, timestamp, or topic.

This also isn’t a problem for signatures: We’ll want to sign the whole commit and not just the JSON payload to avoid replay attacks. So it does not matter how the JSON is serialized.

geigerzaehler · May 13, 2022, 2:18pm

An event declares which peer it was authored by and we want to validate that claim. We achieve this by signing the event with the peer’s secret key.

We’ll use Git’s builtin commit signing system with the field radicle-ed25519. The signature will be encoded using the minicbor encoding of link_crypto::Signature.

Additional envelope fields

To prevent replay attacks the signed data must include everything that determines the event’s meaning. This means it must include the concerned identity and topic. To achieve this we store this data in the event envelope which now looks like this:


{
  "identity": "hnrk8ueib11sen1g9n1xbt71qdns9n4gipw1o",
  "topic": "patch/hyn5r6yejjco8r77yf7gu6gqsetgjsqt5oitpzu5eu791wej6p3xz6/patch-name",
  "peer_id": "hyn5r6yejjco8r77yf7gu6gqsetgjsqt5oitpzu5eu791wej6p3xz6",
  "event": {
    "type": "comment",
    "data": {}
  }
}

An alternative to this approach we considered storing the identity and topic in the root commit of the event log. It is harder to analyze whether this defends against replay attacks. This solution is also harder to implement and it is unclear how this would work when we reference events from other topics. The only advantage is that it is more storage efficient. This, however, is something we can fix later.

Validation

We’ll validate event log refs when fetching them from a seed. Validation consists of the following checks:

The commit message must be formatted properly and the payload can be deserialized into the event envelope.
The commit must be signed with the public key specified in the event envelope

We don’t validate that the topic and identity fields of the event envelops match the identity and topic of the event log we’re validating. This is ok since we see no reason to disallow referencing events from other topics.

alexgood · May 13, 2022, 8:43pm

What does this refer to exactly? Do you mean the

-----BEGIN PGP SIGNATURE-----
...
-----END PGP SIGNATURE-----

block that git encapsulates signatures in as documented. (although I guess you would be using BEGIN SSH SIGNATURE for ed25519 keys). If so I assume that this would be a signature over everything preceding the signature text - including the parents of the commit?

geigerzaehler · May 20, 2022, 7:53am

We’re indeed using a scheme that is adapted from Git commit signature. This means we sign the whole commit object (excludeing the signature header) and put the signature in the commit header. The difference from Git commit signatures is that we’re using a different header (radicle-ed25519) and a different format for the signature. The signature is the base64 encoded, CBOR encoded Signature.