Near Incident - Post Mortem

TLDR


Due to a sudden spike of transactions, mainnet validator node ran out of disk space, leading to inability to produce blocks and corruption of the node state.

What happened?


Disk usage grew quite fast and mainnet pool p2p-org.poolv1.near stopped producing blocks since ~07:00 UTC on 16th of January. Epoch ended ~10:00 UTC and by that time the node had surpassed the epochal kick-out threshold for downtime. Therefore, it was scheduled for temporary kick-out for the next two epochs. While resolving the issue, the validator remained offline for the following epoch resulting in one additional forfeit epoch. Overall, the node was offline for ~3,25 epochs and the validator pool lost 4 full epochs of staking rewards.

What went wrong?


The potential impact of the issue was underestimated.

It was expected to rely on a recent state snapshot that would be readily available. We had no cold backup of a recent node state while available backup nodes were subject to the same issue leading to a loss of access to a synced node. In fact, the official public back-up archives were corrupted too. A GitHub issue was created afterwards.

Monitoring was insufficient.

Near validation infrastructure was undergoing an overhaul, some monitoring facilities were offline. It was expected that the amount of space used on disk would grow more or less linearly. With ~100 GiB of free space it could last for a month. Space clogged up in a matter of days while disk monitoring was not set appropriately to catch the spike and warn in advance.

What went well?


It was immediately notified that node stopped producing blocks and the root cause was identified almost simultaneously. Quite a few validators were affected by the issue, and the community was very helpful.

Impact on clients


All our Near delegators were affected and lost four epochs of staking rewards. To compensate our delegators in full and mitigate their loss, P2P waived the fees until the end of February.

Lessons learned


We should have had better monitoring and collecting disk usage metrics from all nodes at all times including mainnet, backup and Near RPC node. It is important to ensure that back-up nodes are running & synced at all times. In addition, it is important to establish the process of making cold snapshots of the node state on a regular basis and spread this practice to all available networks.

P2P takes full responsibility for the event that led to the weak performance and we are sorry for the inconvenience. Please be assured that P2P is taking actions to eliminate even a small probability of such an event occurring in future.


If you have any questions feel free to join our Telegram chat, we are always open for communication.
Special thanks to Evgeny Kuzyakov & DenysK for providing a state snapshot and general support.