Runtime updates: workflow and testing

igor · June 8, 2020, 8:02am

I’ve come up with a set of rules for developers to make the updates process smooth. It’s tightly tied to this comment, because it also touches testing, but it also slightly expands this design.

We introduce ffnet.wasm, which contains the current runtime published to the ffnet. It can be either stored in the repository or provided with a script fetching if from a commit containing the last release, but this could affect developers’ experience. We also introduce previous_spec.wasm, which contains the last implementation of the previous runtime spec. It probably can’t be easily fetched dynamically, so it must be stored in the repository.

The node always runs the WASM runtime to make its behavior more predictable no matter which network it’s running on. The only exception is local dev, which always uses native.

Every change to the runtime must result in update of latest.wasm. If there’s a specification change, it must be additionally preceded with copying latest.wasm into previous_spec.wasm. All spec changes between two ffnet releases should be clustered into a single version bump and previous_spec.wasm update.

The CI runs e2e tests three times: once with latest.wasm, once with previous_spec.wasm and once with ffnet.wasm. This ensures that master always can be published to replace the current ffnet after either implementation change (compatible with the previous spec and the current ffnet) and specification change (compatible with the current ffnet) in the runtime.

Some old e2e tests are guaranteed to fail with the new runtime and some new tests will fail with the old runtime due to breaking changes. To cope with that some tests need to start with a runtime version check. If it’s one known to be incompatible, the test should terminate immediately with a success.

The devnet is updated whenever there’s a merge to master. The CI artifacts of the node and the CLI are considered canonical on the devnet. If there’s a change in the runtime, it must be published on the network immediately. This may cause need to quickly update the node and the clients, but that’s to be expected on the devnet and shouldn’t slow down development.

The ffnet is updated manually. The update must be marked with a git tag and a GitHub release. latest.wasm must be copied to ffnet.wasm, if it’s decided to be stored in the repository.

If there’s only an implementation change in the runtime since last published version, it can be published on the network immediately. Otherwise the change must be announced to the community to give it time to upgrade and only after at least a week it can be published on the network.

geigerzaehler · June 8, 2020, 1:53pm

Thanks for the post, @igor. I’d like to add a bit more context and detail to this.

First, while you talk about specific files in the code base we intend to replace this with setup where we use build artifacts to get previous runtime blobs instead of checking them into code (GH-492).

Now, we try to achieve two things with this proposal.

The client is compatible with the previous spec version of the runtime. (In addition to the runtime version it is build against)
The runtime in the code is compatible with the runtime deployed on the ffnet.

On 1) it is important to point out that for state objects compatibility is strictly enforced by the planned implementation. Since we’re reusing the runtime definitions of the state objects and the runtime needs to be compatible with all previous storage versions we get this for free. So 1) only provides benefit for the transaction compatibility. However this might also become redundant if we enforce this in the implementation two. This might be possible to enforce if we link the client also against the previous definition of the runtime types as discussed in this topic.

What 2) means requires a bit more unpacking. If the spec version of the runtime on the ffnet is different from the current runtime spec version in the code then “compatibility” means the same as 1). If only the implementation version is different (and the spec version the same) then “compatibility” means that both version are observationally equivalent. That means that both runtimes result in the same state changes when run on any given block. However in the proposal we merely check that the tests pass against the runtime on the ffnet. This seems redundant since this runtime was tested in a previous commit as the “latest” or current runtime.

With that in mind I would suggest we do not run the test against the ffnet runtime.

igor · June 8, 2020, 2:29pm

Not testing against ffnet runtime is fine for me. It has many small downsides and only one small upside of testing against exactly the runtime, which is in the wild, quite paranoid to be honest.

igor · June 8, 2020, 7:53pm

After deeper consideration I think that we should at least test against the previous implementation version to ensure that we didn’t accidentaly introduce a breaking change.

geigerzaehler · June 10, 2020, 6:48am

I’m wondering what we get out of this. The previous implementation was at one point the latest one. And at that point it was tested and shown to work. If we just simply test it again we just verify what we already know. Maybe I’m missing something?

igor · June 10, 2020, 7:19am

It’s not enough to prove that the new implementation works, we must also ensure that it works exactly the same as the previous one. If we accidentaly introduce a breaking change and mindlessly adjust the client to account for it, the tests will keep passing. But if we try to use this adjusted client with a previous implementation, it will fail, which is very much desired.

Think about this: we have a Spec1Impl1 runtime. We redesign the state and release S2I1. Next we upgrade Substrate, which introduces a breaking change to the transaction weight type. We didn’t get our coffee yet, so we release it as S2I2. Our e2e tests checked that we’re compatible with S1I1 and S2I2, so they are all green! If we tested against S2I1, we would get a well deserved failure.

geigerzaehler · June 10, 2020, 11:25am

Thanks for explaining this in more detail. I get it now. So what I was missing was that even though the runtime hasn’t changed the client may have so we need to test the new client against the old impl version of the runtime.