No mention a full node failure and data lost!

opoloko

Member
10 hours ago my VPS went down, late night here, and support said it was a node problem. 8 hours later is still down with support saying "sorry we lost data we need to fully restore and we have only backup up to such&such date, can we proceed?".

This is the worse that ever happened in a decade with KH, and no mention here, and is VERY worring. There are some databases with realtime data from a store, so transactions and other vital informations will be lost.

There should be a full post-mortem report and compensation offered and explanation, and most of all a system to put in place to avoid such problems.
 
As a further update, after HOURS of keeping the server down waiting for a reply from me (not that I had any choice), support said that I was moved to a new node (what I would expect if a node fails!) and so no restore was needed!

Once again, communication was poor, the VPS should have been moved to a new node not left down for hours waiting for a reply from me about a restore (loosing data) that in the end was also not needed!
 
As another further surprise, now they say that actually NO I need a full restore because files were corrupted.
Once again, this is the worst ever happened at Knownhost, communication is incredibly bad and confusing.
 
Howdy opoloko,

Apologies for our admin on shift not posting the typical notice. Because the problems were a bit more than expected he remained focused on getting everyone evacuated and things sorted and forgot the post.

Long and the short of the post mortem is that the node experienced a partial hardware failure, this resulted in corruption of some data / user disks. After it was confirmed the system was no longer stable plans were enacted to move all containers off of this equipment to online hot spares, but in an effort to preserve as much data as possible the existing data was moved first, only until that move is completed for a container can it be fully evaluated to determine the extent of the damage and if a restore was needed.

So far we've only had to restore 2 containers, unfortunately your container was one of them, but this is also why we take and maintain rigorous backups.

I do apologize that you got caught up in this, we do our best to ensure the reliability and redundancy of our equipment, including monitoring for any signs of failure, redundant arrays etc but even with all of that things can and do fail. Specifically in this case, the reason corruption happens in this kind of a crash is that any data that is in-memory cannot be written to disk, which causes that underlying corruption.

Regarding any compensation, this will of course be covered under our SLA so please reach out to our billing department and they'll get that sorted for you.

Again we do apologize for the issues encountered, our admins immediately started corrective action once it happened and are still engaged on the issue for any remaining customers.
 
Howdy opoloko,

Apologies for our admin on shift not posting the typical notice. Because the problems were a bit more than expected he remained focused on getting everyone evacuated and things sorted and forgot the post.

Long and the short of the post mortem is that the node experienced a partial hardware failure, this resulted in corruption of some data / user disks. After it was confirmed the system was no longer stable plans were enacted to move all containers off of this equipment to online hot spares, but in an effort to preserve as much data as possible the existing data was moved first, only until that move is completed for a container can it be fully evaluated to determine the extent of the damage and if a restore was needed.

So far we've only had to restore 2 containers, unfortunately your container was one of them, but this is also why we take and maintain rigorous backups.

I do apologize that you got caught up in this, we do our best to ensure the reliability and redundancy of our equipment, including monitoring for any signs of failure, redundant arrays etc but even with all of that things can and do fail. Specifically in this case, the reason corruption happens in this kind of a crash is that any data that is in-memory cannot be written to disk, which causes that underlying corruption.

Regarding any compensation, this will of course be covered under our SLA so please reach out to our billing department and they'll get that sorted for you.

Again we do apologize for the issues encountered, our admins immediately started corrective action once it happened and are still engaged on the issue for any remaining customers.
Hi Daniel, I appreciate and I know all of this happens. The worry is I have other VPSs with you and in those cases such problem would cause potential big losses, so it worries me a bit.

I do understand it happens, and it seems I was one of the unlucky ones. I just thought that if all is in arrays this would not happen, but I suppose it depends where the hardware breakdown happens.

Anyway, I do appreciate your explanation: support explanations were a bit confusing, and probably was unlucky it happend late at night here and so I was unable to immediately reply to the restore from backup email.

Thanks for your reply.
 
Top