CLI-runtime compatibiliy

geigerzaehler · June 2, 2020, 10:25am

radicle-registry-cli is linked against a specific version of radicle-registry-client which is in turn linked against a specific version of the registry runtime. For users of the CLI this leads to compatibility issues if the on-chain runtime of the network the CLI interacts with does not match the CLI’s runtime version.

With respect to the compatibility we want to achieve the following:

The CLI should be compatible with as many runtime versions as possible
The CLI should give meaningful errors if it is incompatible with the runtime

(The same goals can be formulated for the client library. For now we’ll concentrate on the CLI which will tell us more about the requirements for the client library. We can then collect more requirements from the application team and design a solution for client-runtime compatibility.)

The contract between the CLI and the runtime consists of the transaction types and the storage types (and their storage mechanism). Any change to these may result in incompatibility between the CLI and the runtime. I’ll provide more details and examples on this in a follow-up post.

As a baseline solution I suggest we tackle (2) very narrowly first. This solves the immediate problem that incompatibility is currently not properly detected and may result in the CLI working with garbage data or raising unrelated errors. Once we learn more about the compatibility we can extend the compatibility range. Concretely, I suggest that the CLI will ask the node what runtime version is active and abort if the spec version is different from the version the CLI was compiled with.

igor · June 2, 2020, 1:57pm

We’ve discussed the runtime update matter and we came up with a few conclusions.

When a client compiled for runtime version N tries to talks to a node with runtime version N+1:

Sending any messages is dangerous (even if the data schema is the same, the semantics can change)
Receiving state data is a mixed bag. The old state format will be understood well, but the newer versions will fail to decode, leading to an incomplete view of the global state.

This leads to a conclusion, that a client should agree to communicate only with a runtime it knew during compilation to prevent silent failures. We need two ways of checking the compatibility:

The client should verify the runtime version
The runtime should verify the transaction version. Restoration of the CheckVersion extension is crucial!

There’s a huge problem with the runtime version transition. In order for the client to smoothly go through the update it should support both the old and the new one. This leads to two problems:

There is a need to support two separate data formats, which may lead to manually copy-pasting a big chunks of old versions of substrate and the runtime.
The application-facing API of the client library must somehow account for both versions of the runtime. It may cause an implicit alteration of the contract at the moment of the update if the semantics of messages have changed. It will also cause a breaking change in the function and structure signatures if the content of the message is changed.

But how important is providing of a smooth update? How often is the runtime going to be updated after the release of the mainnet? The scope and functionality of registry is well defined and quite limited, so it may happen extremely rarely, like once every few years or just never.

igor · June 2, 2020, 5:32pm

A followup discussion resulted in a proposal of how the client library can support a set of runtime versions.

First of all the client must fetch the runtime version from the runtime and check if it’s supported. If so, this version should be copied into every transaction to satisfy the CheckVersion extension.

If there are no changes in the subset of the API touched by the client, that’s it.

If the changes can be accounted for without without being observable in the client library API, the client should do that automatically and shield the application from this knowledge.

If there are changes to the storage, the client will be able to decode all the versions supported by a runtime against which it was compiled. In practice it means all versions that were ever used. The application may receive data in a converted form:

in a specifice version (newest?)
in a custom specialized structure
in an enum able to contain any version (the least friendly solution)

If there are changes to data carried in a message, the client library API can accept this data in a different form, which will be then converted into a structure expected by the runtime:

in a specific version (newest?)
in a custom structure containing all the details needed to construct every supported version
in a structure containing every version in a separate field

The last option is especially useful when there are semantic changes. The application can then construct completely different data for each runtime version.

If there are changes to the set of available messages, the client library API must expose the last observed runtime version. The application can then alter its functionalities based on what’s possible to be done with the runtime.

geigerzaehler · June 3, 2020, 1:17pm

There are two subtleties regarding the compatibility we need to keep in mind. First, a transaction message (or “call” in substrate terminology) is represented by the Call enum. This enum and its variants are generated by the decl_module! and construct_runtime! macros. The client code constructs these Call values from the message types that the client defines. For example

runtime::Call(registry::Call::set_checkpoint(params))

If a runtime update now adds or removes a message handler the Call enum variants change and with it possibly the tag that identify a variant when it is encoded. For example whereas the set_checkpoint variant is encoded with tag 4 in version 10 of the runtime it may be encoded with tag 3 in version 11 of the runtime. This means that the client needs to take this into account when it is linked against version 11 of the runtime but submits transactions to version 10 of the runtime.

The second point to consider with respect to the interface between client and runtime are transaction events. Events are a special kind of state. They are stored in the state after a block is executed and cleared when executing a new block starts. In particular events are only ever written by the runtime and unlike other state we don’t need a mechanism for reading old events in the runtime. In fact, how event types are declared used in substrate is very different from other state and the policies for state compatibility don’t apply.

However, a client needs to decode and handle two versions of events if it wants to support two runtime versions. How we solve this in practice remains to be seen. Since substrate recently changed the event type we need to tackle this in #463.

igor · June 3, 2020, 6:51pm

I’m not sure if the Call enum discriminants update is a separate problem. If we’ll be able to update the content of the Call enum and use both versions interchangeably, we’ll need both of them defined as separate types in a client library anyway. The discriminants problem will then be either solved the same way or will solve itself as a side effect.

igor · June 3, 2020, 7:11pm

I don’t think that we’ve discussed it, but the multiple versions of Call and other RPC DTOs may be available for free when it comes to labor. Cargo allows to import different versions of the same crate. The client library should be able to import both a current and a previous versions of the runtime and use them in parallel by renaming one of them. The old version would have to be fetched from git e.g. by a tag. The downside: compilation time and a binary size, especially if they use different versions of substrate.

kim · June 6, 2020, 1:17pm

There might be reasons why this is impractical, but have you considered versioning datatypes and RPCs separately?

Iiuc, any datatype which ends up in permanent storage must be backwards as well as forwards compatible (ie. fields can only be added, never removed, and must be optional). RPCs may or may not be able to handle changed payload schemas, or the payload stays the same, but the semantics have changed. By introducing a new RPC whenever compatibility is not-so-obvious, old clients cannot (easily) trigger that code path, and the old RPC implementation (in the new version) becomes the translation layer, if possible, or returns an error.

igor · June 7, 2020, 8:07pm

have you considered versioning datatypes and RPCs separately?

Oh yes, that was one of the first things to consider. Unfortunately the RPC relies heavily on the data types. We can’t update an important data type without making this change visible on the RPC.

Iiuc, any datatype which ends up in permanent storage (…)

That’s more or less what we’re aiming for, the RPC is versioned with a runtime. Additionally the state data types inside must be versioned separately by wrapping in an enum to allow fully breaking changes like removal of a field. The clients can then reuse the old paths for the old enum variants and add new paths (or translation to old ones) for the new ones.

geigerzaehler · June 8, 2020, 7:42am

@igor Could we spin your comment off into a separate topic about testing? I will answer there then.

igor · June 8, 2020, 8:03am

Sure, it’s here

geigerzaehler · June 8, 2020, 8:26am

As @igor mentioned, we did consider this. I can shed more light onto why this doesn’t solve all issues and is impractical.

The RPC API of a substrate node has no knowledge about the runtime or the runtime types. Specifically, it does not try to interpret the data stored in the chain state. The only method that the RPC API provides with respect to the state is something like get_value(key: Vec<u8>) -> Vec<u8> which reads the raw data stored under a key in the state. This means that any translation into domain objects (as Rust data types) happens on the client side. This also means that the contract of the RPC API doesn’t change with a runtime update.

We looked at this and the devil is in the details with substrate and in a decentralized setup. With this approach the node binary would include code that provides an RPC API that exposes domain objects, that is transactions and state objects.

The first issue is that a node’s RPC API stops working if the runtime is updated to a newer version that the node does not know yet (that is it does not have a translation layer yet.) To see this consider the transaction case. If the runtime is updated the node does not know how to translate an old transaction to a new transaction. (Something similar holds true for retrieving state objects. Although there it is still possible for the node to detect whether it is reading old state that it knows to interpret.) As a consequence the RPC APIs of nodes become effectively disabled if a node is not upgraded.

The second issue is the additional complexity in introducing and maintaining this transition layer. We would need to create a set of completely new RPC API and deal with all the infrastructure for that. Then we would need to consider translating between any RPC version and any older runtime version (i.e. n*m/2).

Compared to what @igor proposed the only difference I see with this approach is that the compatibility is handled on the node side as opposed to the client side. Given all the issues and cost associated with the node-side approach we need to get good value out of that approach. But as far as I understand it the only problem it solves is that consumers don’t need to update their software to work with a newer version of the runtime. And here I’m actually ok with forcing users to upgrade if their software becomes outdated.

kim · June 10, 2020, 8:06am

Thanks for the explanation. I can guess the reasoning for this design (of Substrate) — anchoring the runtime version on-chain is always unambiguous — but still find it a bit disappointing. It’s not solving a problem, but moving it somewhere else.

But ok. Just to understand this better: is your approach for the clients to force a software update via the native package manager at a specific block height, or have some kind of interpreted mode (just like Substrate itself), or both?

geigerzaehler · June 10, 2020, 11:40am

For now we plan on implementing the following compatibility policy: Assume we plan to update the runtime from, say, version 5 to version 6. We will announce the date where the runtime update is submitted to the block chain and takes effect. (Runtime updates don’t happen at a predetermined fixed block height but when the update transaction is submitted.) Before this update date we will release a version of the client that is compatible with version 5 which is running currently on the chain and the upcoming version 6. This gives users of the client library (at the moment only upstream) time to update the client and be prepared for the runtime update. When the update takes effect old versions of the client that were compatible with runtime versions 4 and 5 will then stop working. They will return an error which can be shown to the user by the client cosumer, in our case upstream.

For the medium term we think this strategy is sufficient since the only software that uses the client is upstream. Wider compatibility in the client would only provide benefits if the compatibility is also achived in upstream. Because of the number and size of the changes this seems unlikely.