RFC: Entity Storage

mmassi · June 16, 2020, 10:24am

I’d like to add this RFC to the radicle-link repo, but I’m posting it here first

RFC: Entity Storage

Author: @massimiliano-mantione
Date: 2020-06-16
Status: draft
Community discussion: here…

Motivation

The identity resolution RFC specifies how in a radicle device all known entitis are stored in the same monorepo.

It also states that each namespace should refer to every entity needed to verify the namespace entity so that it is guaranteed that none of them is missing, and that all of them can be acquired with a single git fetch operation.

The goal of this RFC is to specify the details of how this happens.

Entity revision history

Each entity revision is stored in one git commit with the following properties:

the commit refers to a tree object containing a single file named id, which is the signed entity metadata
if the entity revision is n with n > 1, the first parent of the commit must refer to the commit containing revision n

It is therefore possible to traverse the entity revision history in reverse order by walking the list of parent commits of any given entity commit.

Entity certifiers reachability

For entities that are signed by other entities (referred as certifiers), the entity commit must contain one additional parent for each direct certifier, referring to the commit containing the certifier revision that was used to sign the given entity.

The order of these parents is deterministic, and it is the lexicographical order of their git hashes (OIDs) as byte arrays.

Failure to comply to these storage format requirements must be handled as an entity verification error.

How entity transfer and verification exploit these storage properties

The reachability property ensures that fetching a given entity commit without limiting the fetch depth will fetch every additional entity needed to verify the current entity (at all their needed revisions).

Mandating determinism in the storage of the commit parents ensures that the OID of each entity revision is also deterministic.

This happens because a signed entity is itself immutable, and its set of certifiers is fixed.

Its commit must therefore be prepared with the exact set of needed “direct” certifiers (which is the set of certifiers in the entity metadata).

What is nice is that every reference to an entity revision in the monorepo is simply a reference to its commit, and it is not possible for different OIDs to refer to the same entity revision.

Entity verification can use a cache stating the status of every known entity revision (and their git OIDs).

When new entity revisions are fetched they are initially referred by the git refs under the peers hierarchy (because they have been received but not verified yet).

The verification can happen by inspecting the entity objects referred by those “untrusted” refs: if the verification passes new revisions (and possibly entirely new entities) will acquire the “verified” status.

When this happens, the refs of the current device can be updated to directly refer to the commits that have been verified.

The fact that git OIDs are canonical guarantees that if the same entity revision is seen again from another peer it will directly point to the same git object used by the current peer, so that the system will never have to choose between two different OIDs that could represent the same entity revision.

Therefore the “local view” of the entities (the ref/id reference under each namespace) can simply be updated to point to the git OID of the highest known verified revision for that entity.

Recursive verification implementation

Verifying an entity at revision n requires having completed the verification of its revision n - 1 and of each of its certifiers.

This algorithm is logically recursive, however its implementation should avoid using recursive invocations because large (but legal) entity graphs could cause stack overflows.

Since entity revision numbers are finite the verification of each entity is guaranteed to terminate.

The only kind of “DoS attack” by complexity that we can imagine is building a graph where each certifier needs in turn another certifier to be verified, and growing the number of needed cetifiers.

However the graph cannot be cyclic because if entity A requires certifier B, and in turn B requires A, B must require a previous revision of A.

This is mandatory because when B was signed with A’s key, A must have been at a previous revision otherwise the signature would have been invalid.

This is a nice property because if the git fetch of the graph of certifiers terminated successfully also the verification is guaranteed to terminate.

And the mandated parent encoding guarantees that a single fetch of a given entity revision will fetch the whole certifier graph.

Thanks to the git packing magic only the needed entities will actually be transferred but at the end of the operations all the needed ones will be present in the object database.

Conflict with identity resolution RFC

In that RFC the <namespace>/refs/ hierarchy was introduced to help in identifying and transferring all the entities needed to verify a given namespace.

However with the proposed commit parents storage scheme it is likely that those refs are not needed at all:

fetching a single revision is enough to transfer all the needed ones
the verification algorithm can be follow the commit parents instead of looking up the <namespace>/refs/ references

Not needing to setup those references could be benefical, both for implementation simplification and because having less refs in the repository will decrease the complexity of git operations (like globbing in the refs hierarchy).

Open issues

One problematic point is still what to do if, for any reason, the system meets two alternative representations of the same entity revision.

The simplest approach is to reject both.

There are only three cases that could lead to this case:

one or more keys have been compromised
the user is maliciously spreading alternative “versions” of those revisions
the user did an incredibly gross mistake updating the entity (combined with a tricky situation of devices being unreachable at critical times)

Ruling out case 3 (it is so hard to do that it resembles case 2), since both 1 and 2 are essentially malicious rejecting the entity from that revision onwards seems a reasonable approach.

The problem is “from that revision onwards”.

Previous revisions cannot be thrown away (they were legit after all).

But the offending, “duplicated” revision could appear very late in time, after lots of other revisions and certifications have happened, have been exchanged and are part of the network of entities.

The problem is logically always the same: only with a blockchain it is possible to guarantee that forks will not happen and the past will never change.

Without radicle-registry, radicle-link cannot offer these guarantees.

Users know this, it’s part of the deal, and it’s the real differentiation between the radicle-link and radicle-registry offerings.

Nevertheless I still consider how to handle this case a tricky issue.

The “reject both revisions” approach risks to confuse users because they could see entities “disappear”.

Probably the best approach would be to “taint” them, and to have a verification status that is ForkTainted: the revision by itself should be valid, but an equally valid branch has been created.

The UI should not make those revisions disappear, but it should clearly flag them as “tainted”.