Rambles on updating the Radicle Registry

NunoAlexandre · May 7, 2020, 5:40pm

Once the Radicle Registry goes live, integrating updates and breaking changes will no longer a mere casualty solved by hard resetting the chain but a problem we need to get our heads around, ensuring that our public network keeps being compatible version after version.

While we, developers, always build and run the latest versions of our software, the story is usually different when it comes to users. Users don’t know how the internals work, error messages might not be as clear for them, they are not in the loop about new versions as eagerly as we are, etc.

For that reason, thinking about the process we have to follow to update our software is also a matter of thinking about how to keep the user community engaged.

We currently provide two binaries, radicle-registry-node and radicle-registry-cli . The binary providing the biggest challenges is the first, as it involves the ledger itself, or runtime as we call it at a technical level. However, there are three categories of updates to our public software: updates to the CLI, updates to the node, and updates to the runtime. Let’s start with the latter.

Updates to the runtime

The most challenging chunk. Within this category, we see another three different categories, ordered by increasing complexity.

Implementation changes

Changes that don’t change the business logic of the ledger neither its chain state. Say, a minor dependency update, some refactor work, etc.

The first question that comes to mind is, how can we define objectively what is “just” an implementation change? One possible answer is that the tests are left untouched without decreased code coverage. Another is that the CLI remains compatible without any change to it.

Since this is the simplest form of runtime updates, we want to start here. How do we go about updating the runtime executed by the running nodes without stopping the world? We will do that by compiling a new WASM version and broadcast it to all nodes via a transaction, that will be captured in the block that includes it. Here, we need to build a tool to prepare and submit this transaction.

We figured that users who download and install the last version of our radicle-registry-node

binary should, by default, run the latest native version of the runtime, which - at that point - is the same as the wasm counterpart. That would be helpful due to the improved performance that a native implementation offers. However, once a new WASM version is received, said version should be used to produce blocks instead of the native, which becomes obsolete at that point.

Logic changes

These are changes that affect the behaviour of the ledger. Say, a new validation step that is added against a transaction, for example.

Such changes to the runtime should result in changes to the tests, be it adding, adjusting, or deleting some. However, just like for implementation changes, these also shouldn’t require any change to the CLI. However, since such updates might break previously successful user patterns, we will have a publicly announced changelog describing such changes and how to re-adjust. The documentation provided at registry.radicle.xyz/docs will also be updated.

State changes

Here lives the hairiest of the beasts. Say, we move from having the members of an organization stored as account ids to user ids. Such change fundamentally impacts the way different versions parse and handle data from the chain. For those new to the block-chain, as I am, this is an issue because rewriting the chain state is not the way to go to, unlike one would do when running a migration on a PostgreSQL instance.

We could decide, in this early stage, to hard-fork when faced with such type of change, meaning that we would run a new, parallel network where these breaking changes would start making their way in a new, fresh ground. However, that is not going to cut once we go main net, so we are focused on learning how to do this properly as early as possible.

In other words, we need to be backwards-compatible concerning the chain state. We need to devise a way to deserialize and support older forms of the data that might still be on-chain, as well as the new one. Given the example above, when deserializing Orgs, we need to think of the old and the new variants, the one where its members were account ids and the one where they now are user ids. To help users adjust to new realities, we must provide methods for them to “migrate their data” to its latest form, not only giving control to users but also keeping everything tracked in verifiable transactions.

This category might introduce breaking changes to the CLI. To cover for this, we need to have our client ensure that the version of the runtime the node it is connected to is not greater than the one it supports. That means that the state changes that impact the CLI require a new version of the CLI to be installed. Not all state changes will lead to this path though, as some types stored on-chain aren’t necessarily the types exposed to our client.

Changes to the CLI

Changes to the radicle-registry-cli live on the simpler side of the spectrum. We will be providing new binary versions when appropriate via our releases page, which might require updates to our documentation. The docs will need to point out the version of the CLI and node they refer to, to make potential observable deviations in behaviour or offer explicit to our users. However, users won’t be checking the docs nor the releases page all the time, so we are considering leaving a friendly announcement from within the CLI let you know that an update is available.

Changes to the node

Much of what applies to the CLI applies to the node as well. The one type of change to the radicle-registry-node that would require further consideration would be a change to the consensus. Such change would require a hard-fork however, which isn’t something we are considering for the foreseeable future.

fintohaps · May 8, 2020, 10:42am

Nice criteria, but I imagine that’s only if you have good confidence in the coverage in the first place. Does the registry use any code coverage tooling at the moment? I’ve briefly used cargo tarpaulin to ensure I was covering small parts of the radicle-surf library.

Is this a similar vein as to how Tezos update their code and chain? Not that I actually know much on that front, but just have a shallow level of knowledge on the matter.

What does this mean in practice? Do their actions on the chain suddenly start getting rejected and are prompted to update?

Heh, so Registry and CoCo are thinking of similar problems. I was just discussing with @mmassi about data migration when it comes to code collaboration data. Maybe there’s room for the two teams to exchange ideas

Finally, thanks for throwing your thoughts down here. Super helpful to get an insight into Registry’s work from a developer stand-point

NunoAlexandre · May 8, 2020, 12:03pm

We are not using any code coverage tool at the moment. The tests this bit focuses on are the behavioural and integration tests, where the submission and output of transactions are tested, not so much the smaller unit tests. For instance, testing whether registering an org actually stores said new org. These tests must be in place to verify what is defined in the Registry specification.

Having this in mind, the idea with that criteria is that if a change is purely implementational within the runtime target, we shouldn’t need to add, remove, or adjust any tests.

Will have a look at tarpaulin, thanks

First time I hear about Tezos, I believe. Will also check it out!

That’s a good question! Yes, some actions that previously worked might get rejected, accompanied by a useful error message. We have two upcoming examples:

#397 introduces a new validation step when registering orgs, where the transaction author must have an associated registered user in the ledger. Users who are currently able to register orgs without having their key pair associated with a User in the ledger will fail to register new orgs until they establish that association.
#397 changes the Org::members field from Vec<AccountId> to Vec<Id> (where Id is a user id). Users who have registered orgs before this change gets released will be offered a way to migrate their own to the latest version of the ledger. We could achieve that by having a command that will map each AccountId to its associated UserId, if one is present (not guaranteed given that’s a requirement only introduced in the also upcoming change linked above). If all the members of the org have an associated user Id, the migration can and will take place, shall the user choose to do it, which is naturally incentivized since obsolete orgs could be unusable.

He, let’s do that, absolutely. @igor has pointed out serde adjacently tagged enums as a way to tackle the storage side of things.

It’s delightful to hear that, thanks!

kim · May 9, 2020, 9:49am

This is a somewhat surprising conclusion, as you said earlier that the chain state must evolve in a backwards-compatible way. I think the problem is not really blockchain-specific: during an upgrade, one needs to be prepared to handle both the old and the new version for some window of time. If that’s the case, then all components need to be forward-compatible as well — up to a limit perhaps, after which one may want to force people into upgrading their software. But that’s ideally controlled by policy, not the software suddenly throwing errors.

Yaaa, one of the many issues with serde is that those tags are all strings — this means that a rename (or a typo) can break things in unexpected ways, and it’s hard to guard against this. Consider that there may exist third-party clients at some point.

The other problem I see is that sum types (enums) are not so obvious to evolve in a forward-compatible way: one needs to maintain a variant for the unknown case, with enough information attached to provide meaningful error messages. I don’t think serde supports this without manual impls.

NunoAlexandre · May 11, 2020, 8:50am

We are on the same page here. Note that it says “… is not greater than the one it supports”. Say client v0 depends on runtime v0, we might manage to make it compatible with runtime v1, v2, v3, etc. However, there might come a time, when that some window of time can’t take it anymore, where the CLI will need to be updated to run the latest client and therefore be compatible with the latest running nodes in the network.

Can you expand?

Having been a java developer many years back, I understand the pain you are alerting to rather well (cough Spring boot cough). However, in this case, we would be (de)serializing raw content from disk. How could we be type-safer?

kim · May 11, 2020, 12:19pm

I mean that there is a value (say min_client_version or whatever) which the client reads from the node, and if that’s greater than the client’s version, the client will print a friendly ASCII cow and otherwise refuse to function. Until then, it does what it used to do, but may be able to hand through deprecation warnings to the user.

This way, the chain state decides when exactly the time has come.

We can’t be type-safer, but we can be speling-safer by using a numeric tagging system like protocol buffers and its descendants (along with their schema evolution conventions, of course).

NunoAlexandre · May 12, 2020, 7:45am

Yes, that is a good switch of responsibilities. Thanks for the input!

Thanks, we will keep this in mind.

geigerzaehler · May 12, 2020, 8:53am

Thanks for posting this write-up. I have a couple of comments and questions to clarify our understanding

I’d consider the definition the following: A change is an implementation change if executing the runtime before the change and after the change on any block results in the same state. In addition to thorough code reviews and unchanged tests and CLI we can also validate that a change is an implementation change by running it alongside the Wasm “reference implementation” or other native versions for a while.

What is “that point” at which the on-chain wasm runtime and the native version are the same? Because after releasing a new node with an implementation change to the runtime the native version will be different from the on-chain runtime until the latter has been updated.

Could you expand on this in light of my comment above? Is the new on-chain runtime different in behavior or is it just an implementation update?

This sounds like logic changes are a subset of implementation changes but this is not correct. The logic changes you describe are distinct from implementation changes because they indeed affect the behavior.

CLI version compatibility

@kim raised some good points regarding forwards compatibility. I think it applies more to the app in the future than the CLI, though. The CLI is intended foremost as a developer and experimentation tool. Especially in the early stage it is easier for us to manage a simpler compatibility strategy for the CLI.

NunoAlexandre · May 12, 2020, 10:58am

Thank you as well

As pointed out here, this criterion can be fallacious given that is is virtually impossible to actually test it against any block.

By “that point” I mean when the node is freshly installed.

Well, the idea was to pick on the latest WASM runtime independently on the type of change it introduces. It could be made optional for implementation changes but it would have to be mandatory for logic and state changes.

Right, poor wording, I didn’t mean to suggest that. I will update the original.

Thanks for your input!

geigerzaehler · May 12, 2020, 11:42am

At that point the native runtime version will not be the same as the on-chain Wasm runtime

I think the question is do we want to upgrade the on-chain runtime if we change the implementation? And if the implementation version is the same for on-chain and native which one do we run? We definitely need to run the updated runtime for changes to the semantic (logic and state changes).

NunoAlexandre · May 13, 2020, 8:58am

Ah! There I had a wrong understanding. I was under the assumption that the new wasm would be generated when building the node binary, although now that I read this that would not make any sense, while using the latest wasm on-chain does. Thanks!

I guess the question you are pondering is, why do we need to submit a new runtime only containing implementation changes if in practice there is no added value to the network since said new version would do exactly the same as the one running already.

I’d say that continuous integration is a reason to release implementation changes. Say we keep merging changes of that nature to master for a month, and we finally merge a semantic change and release that. Now the risk of integrating that is much higher and more difficult to rollback or pinpoint if something goes wrong.

So, in principle, the network doesn’t per se benefit from new implementation changes being release but it doesn’t hurt either and helps us have a healthy, less error-prone CI workflow in place.