Slashing Event in Kusama - Post Mortem

Summary

115 validators of different staking providers received "unapplied slash" during the era 2249 in Kusama due to not sending "I'm online" signal. 45 out of 115 validators are P2P validators where a possible slash would be around 0.6% of active stake.

Unapplied status means that it can be reverted through governance action, which is exactly what happened in that case.

Customer Impact

All slashings have been reverted and no customer funds are impacted thanks to councils who supported the request and voted for the motion 295 to cancel the slashing.

What Happened

Almost all of the active P2P Validator nodes were running 0.9.0 version. Monitoring showed perfect performance although sometimes we detected CPU spikes during the election.

On May 13, there was a release note issued for the 0.9.1 version. The release had “Low (upgrade at your convenience)” upgrade priority. P2P started to roll out this version gradually because this is our SoP for low priority upgrades.

On May 14, we got a notification via Kusama Validator Lounge chat which said: “Just a heads up that the 9010 runtime upgrade will happen in ~55 minutes at block 7,468,792. If your node is not on at least version v0.9.0, it will not be able to sync after that”

After the runtime 9010 had been applied, part of our monitoring services based on @polkadot-js/api stopped working due to it being an older version than required with the fresh runtime upgrade. We had an alert which would  have been triggered if all CPU cores had 95%+ load at once, but as only one core had been overloaded, it didn't fire. Therefore, we didn’t figure out that a  number of our nodes would get slashed. It took about 40 mins to understand what exactly had happened.

See diagrams:

CPU time of node under v0.9.0 that experienced CPU load during the election.

CPU time of node under v0.9.1 - no spikes

Logs records nodes running on v0.9.0

Looking at logs it becomes clear that after the runtime upgrade, nodes were experiencing some issues with performance. We caught 61780 events, which is a huge number and very unusual.


During CPU spikes, our validators were not able to send "heartbeat" events. Therefore all 115 validators got slashed due to the concurrent unresponsiveness of a bunch of validators, see offence Level 2.

In order to avoid further slashing, P2P team undertook the necessary steps as follows:

May 14, 21:30:00

  • Fix monitoring service

May 15, 00:30:00

  • Validation is recovered on all P2P validators

May 15, 00:40:00

  • Investigation in collaboration with other nodes operators and the Kusama dev team occurred.

May 15, 05:00:00

  • All Kusama nodes have upgraded to v0.9.1 that solved CPU spikes issue

May 15, 19:00:00

  • Set “--wasm-execution Compiled” flag for all Kusama nodes

----

Slashing was reverted by one of council member Raul Romanutti who proposed cancellation of slash submitting the motion 295 and no user funds were affected.

----

Lessons Learnt

  • We should have  more resilient monitoring for our nodes and infrastructure
  • Follow polkadot-js releases to upgrade libraries on time
  • Look closely at network status during runtime upgrades

What’s Next

  • We are going to work on new infrastructure and improvements for monitoring services that will help us to avoid such cases.