No mention a full node failure and data lost!

opoloko · Jan 31, 2025

10 hours ago my VPS went down, late night here, and support said it was a node problem. 8 hours later is still down with support saying "sorry we lost data we need to fully restore and we have only backup up to such&such date, can we proceed?".

This is the worse that ever happened in a decade with KH, and no mention here, and is VERY worring. There are some databases with realtime data from a store, so transactions and other vital informations will be lost.

There should be a full post-mortem report and compensation offered and explanation, and most of all a system to put in place to avoid such problems.

opoloko · Jan 31, 2025

As a further update, after HOURS of keeping the server down waiting for a reply from me (not that I had any choice), support said that I was moved to a new node (what I would expect if a node fails!) and so no restore was needed!

Once again, communication was poor, the VPS should have been moved to a new node not left down for hours waiting for a reply from me about a restore (loosing data) that in the end was also not needed!

opoloko · Jan 31, 2025

As another further surprise, now they say that actually NO I need a full restore because files were corrupted.
Once again, this is the worst ever happened at Knownhost, communication is incredibly bad and confusing.

KH-DanielP · Jan 31, 2025

Howdy opoloko,

Apologies for our admin on shift not posting the typical notice. Because the problems were a bit more than expected he remained focused on getting everyone evacuated and things sorted and forgot the post.

Long and the short of the post mortem is that the node experienced a partial hardware failure, this resulted in corruption of some data / user disks. After it was confirmed the system was no longer stable plans were enacted to move all containers off of this equipment to online hot spares, but in an effort to preserve as much data as possible the existing data was moved first, only until that move is completed for a container can it be fully evaluated to determine the extent of the damage and if a restore was needed.

So far we've only had to restore 2 containers, unfortunately your container was one of them, but this is also why we take and maintain rigorous backups.

I do apologize that you got caught up in this, we do our best to ensure the reliability and redundancy of our equipment, including monitoring for any signs of failure, redundant arrays etc but even with all of that things can and do fail. Specifically in this case, the reason corruption happens in this kind of a crash is that any data that is in-memory cannot be written to disk, which causes that underlying corruption.

Regarding any compensation, this will of course be covered under our SLA so please reach out to our billing department and they'll get that sorted for you.

Again we do apologize for the issues encountered, our admins immediately started corrective action once it happened and are still engaged on the issue for any remaining customers.

opoloko · Jan 31, 2025

KH-DanielP said:
Howdy opoloko,

Apologies for our admin on shift not posting the typical notice. Because the problems were a bit more than expected he remained focused on getting everyone evacuated and things sorted and forgot the post.

Long and the short of the post mortem is that the node experienced a partial hardware failure, this resulted in corruption of some data / user disks. After it was confirmed the system was no longer stable plans were enacted to move all containers off of this equipment to online hot spares, but in an effort to preserve as much data as possible the existing data was moved first, only until that move is completed for a container can it be fully evaluated to determine the extent of the damage and if a restore was needed.

So far we've only had to restore 2 containers, unfortunately your container was one of them, but this is also why we take and maintain rigorous backups.

I do apologize that you got caught up in this, we do our best to ensure the reliability and redundancy of our equipment, including monitoring for any signs of failure, redundant arrays etc but even with all of that things can and do fail. Specifically in this case, the reason corruption happens in this kind of a crash is that any data that is in-memory cannot be written to disk, which causes that underlying corruption.

Regarding any compensation, this will of course be covered under our SLA so please reach out to our billing department and they'll get that sorted for you.

Again we do apologize for the issues encountered, our admins immediately started corrective action once it happened and are still engaged on the issue for any remaining customers.

Hi Daniel, I appreciate and I know all of this happens. The worry is I have other VPSs with you and in those cases such problem would cause potential big losses, so it worries me a bit.

I do understand it happens, and it seems I was one of the unlucky ones. I just thought that if all is in arrays this would not happen, but I suppose it depends where the hardware breakdown happens.

Anyway, I do appreciate your explanation: support explanations were a bit confusing, and probably was unlucky it happend late at night here and so I was unable to immediately reply to the restore from backup email.

Thanks for your reply.

opoloko · Feb 10, 2025

BenjaminLewis said:
This is unacceptable. Data loss and downtime need real accountability now.

I do agree, I struggle understanding how a fully redundant setup for VPS (what I expect apart from backups) can actually loose data. Some downtime fair enough, but not data loss.

Also, this is the second time I had a problem with one of my VPSs because of a node failure...and in one instance it went on for months (poor performance and apps killed for supposed RAM usage spikes and random cPanel errors) and only in the end there was an admission it was a node problem and I was moved on a new node and all was solved.

This time I lost data and, in common with the other time (two different VPSs and accounts), the monitoring system didn't really detect any anomaly...it was me having to write and then in the reply knowing that they knew there was a node problem.

I think KH is still a fantastic service, but these nodes failures and a poor management of solving it makes me worry...imagine the money loss for data loss in an ecommerce store...it's not acceptable.

KH-DanielP · Feb 10, 2025

opoloko said:
I do agree, I struggle understanding how a fully redundant setup for VPS (what I expect apart from backups) can actually loose data. Some downtime fair enough, but not data loss.

Also, this is the second time I had a problem with one of my VPSs because of a node failure...and in one instance it went on for months (poor performance and apps killed for supposed RAM usage spikes and random cPanel errors) and only in the end there was an admission it was a node problem and I was moved on a new node and all was solved.

This time I lost data and, in common with the other time (two different VPSs and accounts), the monitoring system didn't really detect any anomaly...it was me having to write and then in the reply knowing that they knew there was a node problem.

I think KH is still a fantastic service, but these nodes failures and a poor management of solving it makes me worry...imagine the money loss for data loss in an ecommerce store...it's not acceptable.

Benjamin has been removed from this conversation as he is neither a customer nor a legitimate poster, with multiple IPs linked to known spam sources and VPNs.

I sincerely apologize for the data loss you experienced—while extremely rare, it did happen, and there's no undoing that.

Regarding your past issue, I'd need to review your previous tickets to provide a full response. That said, we’ve significantly improved our internal monitoring and balancing systems to better detect and mitigate node-level problems before they escalate.

We completely understand that downtime and data loss are unacceptable, especially for businesses relying on their VPS for critical operations. Our standard plans are designed to offer the best balance of performance, redundancy, and cost. However, for those requiring near-zero downtime and data loss mitigation, we offer higher-tier solutions with additional failover protections. These come at a higher cost, as true high-availability infrastructure requires greater investment.

We always aim to strike the best balance between cost and reliability, and while no system is infallible, we continuously refine our approach to minimize risk and improve response times.

opoloko · Feb 10, 2025

KH-DanielP said:
Benjamin has been removed from this conversation as he is neither a customer nor a legitimate poster, with multiple IPs linked to known spam sources and VPNs.

I sincerely apologize for the data loss you experienced—while extremely rare, it did happen, and there's no undoing that.

Regarding your past issue, I'd need to review your previous tickets to provide a full response. That said, we’ve significantly improved our internal monitoring and balancing systems to better detect and mitigate node-level problems before they escalate.

We completely understand that downtime and data loss are unacceptable, especially for businesses relying on their VPS for critical operations. Our standard plans are designed to offer the best balance of performance, redundancy, and cost. However, for those requiring near-zero downtime and data loss mitigation, we offer higher-tier solutions with additional failover protections. These come at a higher cost, as true high-availability infrastructure requires greater investment.

We always aim to strike the best balance between cost and reliability, and while no system is infallible, we continuously refine our approach to minimize risk and improve response times.

Hi Daniel, as usual thanks for this.

I can go on in private about the other incident, but most crucially I'm quite interested in these higher-tier solutions with additional failover protections for one of my VPSs.

Is it something we could chat about privately in DM or shall I ask something specific to Billing or check online? We're also using on some VPSs the legacy plans as we need to (better burst performance), so if there was a way to have something more tailored would be great.

KH-DanielP · Feb 10, 2025

opoloko said:
Is it something we could chat about privately in DM or shall I ask something specific to Billing or check online? We're also using on some VPSs the legacy plans as we need to (better burst performance), so if there was a way to have something more tailored would be great.

Easiest way is just hop on sales chat on the main site and ask for me, I've got time this AM so I can sync up with you there.

richard_s · Feb 11, 2025

opoloko said:
I do agree, I struggle understanding how a fully redundant setup for VPS (what I expect apart from backups) can actually loose data. Some downtime fair enough, but not data loss.

Assuming RAID5 array, two disk failures is one way.

Obviously I'm not speaking for KH but generally speaking a hosts backup included with any hosting plan really isn't your backup, it's their backup for disaster recovery. It's not something to rely on because it's limited, out of date etc. If for example you move a bunch of forum posts into a forum with auto prune. You make your single backup overwriting the old one. A week later you discover your mistake. Your backup and the hosts backup doesn't contain missing posts. I won't mention the name of the idiot that did this 20 years ago but it was last time they lost data.

If you want a poor man's solution take a look at AutoMySQLBackup

GitHub - sixhop/AutoMySQLBackup: A fork and further development of AutoMySQLBackup from sourceforge. http://sourceforge.net/projects/automysqlbackup/

A fork and further development of AutoMySQLBackup from sourceforge. http://sourceforge.net/projects/automysqlbackup/ - GitHub - sixhop/AutoMySQLBackup: A fork and further development of AutoMySQL...

github.com

I only use the daily full backup but it also does revisions which is what you need, it can encrypt the files, send to offsite storage etc.

For the daily backup it's configured to do full backup every day, on the 7th day it creates weekly backup and there is one created each month. Older backups are rotated out. At any given point you have snapshot of the last 6 days, one for the last 5 weeks and one for each of the last 6 months. This is all configurable to your own needs.

I then use Windows task scheduler to fire off a WinSCP script nightly that automatically syncs the daily and weekly backups to local computer, the monthly backups are just downloaded and all kept.

Same thing with site files, they are also synced to local machine. I only have one critical directory, I don't sync that but instead just download new files.

opoloko · Feb 17, 2025

richard_s said:
Assuming RAID5 array, two disk failures is one way.

Obviously I'm not speaking for KH but generally speaking a hosts backup included with any hosting plan really isn't your backup, it's their backup for disaster recovery. It's not something to rely on because it's limited, out of date etc. If for example you move a bunch of forum posts into a forum with auto prune. You make your single backup overwriting the old one. A week later you discover your mistake. Your backup and the hosts backup doesn't contain missing posts. I won't mention the name of the idiot that did this 20 years ago but it was last time they lost data.

If you want a poor man's solution take a look at AutoMySQLBackup

GitHub - sixhop/AutoMySQLBackup: A fork and further development of AutoMySQLBackup from sourceforge. http://sourceforge.net/projects/automysqlbackup/

A fork and further development of AutoMySQLBackup from sourceforge. http://sourceforge.net/projects/automysqlbackup/ - GitHub - sixhop/AutoMySQLBackup: A fork and further development of AutoMySQL...

github.com

I only use the daily full backup but it also does revisions which is what you need, it can encrypt the files, send to offsite storage etc.

For the daily backup it's configured to do full backup every day, on the 7th day it creates weekly backup and there is one created each month. Older backups are rotated out. At any given point you have snapshot of the last 6 days, one for the last 5 weeks and one for each of the last 6 months. This is all configurable to your own needs.

I then use Windows task scheduler to fire off a WinSCP script nightly that automatically syncs the daily and weekly backups to local computer, the monthly backups are just downloaded and all kept.

Same thing with site files, they are also synced to local machine. I only have one critical directory, I don't sync that but instead just download new files.

Hi I totally agree, and I already use a full backup policy and program, but that's not the point. On a high traffic ecommerce store, for example, a daily or even hourly is not near enough in high traffic times. The problem I'm trying to address here is reliability at the source, and a definitely poorer one from KH recently (see my next post).

opoloko · Feb 17, 2025

KH-DanielP said:
Easiest way is just hop on sales chat on the main site and ask for me, I've got time this AM so I can sync up with you there.

Hi Daniel

I didn't have time for this but I'm wondering if you are around this morning? Last evening and last night, ANOTHER NODE had a problem (with one of my other VPSs).

There have been many recently, more than in all the previous 10 years or more I've been with you, so I have a genuine concern now as this keeps happening: your node's reliability has been getting down a lot in the last few months.

This last one happened, you said it was fixed, and as usual once again in a few hours it happened again, and I'm sure it will happen again until you'll finally admit you need to move to a new node.

KH-JonathanKW · Feb 17, 2025

HI Opoloko,

Let me preface this by apologizing for the events that you've experienced. Two difference incidents in which you've happened to have a VPS on each node associated.

Hardware is inevitably ages and we do make every attempt to replace failed hardware to ensure proper consistency moving forward in tandem with replacing hardware as needed.

In this case, the issue stemmed from the RAID card dropping. We migrated all the drives to a different node, and initially, everything appeared to function properly. However, over time, it became clear that the underlying OS was still experiencing issues, requiring further intervention to stabilize the affected containers (which is ongoing at this time)

We don’t like these situations any more than you do, I assure you.

Since October, we've had three incidents—two were reported, and one didn’t make it to the forum (which this thread originally addressed).

It's not fun for us, it's never good when hardware that is seemingly fine for days, months of a time suddenly drops off with no indicating signs resulting in immediate frustration not for ourselves, but for the customers who are having to deal with this as a result when it shouldn't have happened to begin with.

A drive fails? Two drive fails? That's within parameters, that's expected. It's planned for.

A RAID card just up and dying? That's not something you can particularly plan for beyond having spare hardware available and hoping the underlying RAID survived.

That said, we have a strong history of quick turnarounds. Most incidents are brief disruptions, and when hardware failures do occur, we act swiftly, as we always have.

Please contact the Billing department for compensation regarding this event, reference this thread so that it crosses my desk.

opoloko · Feb 17, 2025

KH-JonathanKW said:
HI Opoloko,

Let me preface this by apologizing for the events that you've experienced. Two difference incidents in which you've happened to have a VPS on each node associated.

Hardware is inevitably ages and we do make every attempt to replace failed hardware to ensure proper consistency moving forward in tandem with replacing hardware as needed.

In this case, the issue stemmed from the RAID card dropping. We migrated all the drives to a different node, and initially, everything appeared to function properly. However, over time, it became clear that the underlying OS was still experiencing issues, requiring further intervention to stabilize the affected containers (which is ongoing at this time)

We don’t like these situations any more than you do, I assure you.

Since October, we've had three incidents—two were reported, and one didn’t make it to the forum (which this thread originally addressed).

It's not fun for us, it's never good when hardware that is seemingly fine for days, months of a time suddenly drops off with no indicating signs resulting in immediate frustration not for ourselves, but for the customers who are having to deal with this as a result when it shouldn't have happened to begin with.

A drive fails? Two drive fails? That's within parameters, that's expected. It's planned for.

A RAID card just up and dying? That's not something you can particularly plan for beyond having spare hardware available and hoping the underlying RAID survived.

That said, we have a strong history of quick turnarounds. Most incidents are brief disruptions, and when hardware failures do occur, we act swiftly, as we always have.

Please contact the Billing department for compensation regarding this event, reference this thread so that it crosses my desk.

Hi Jonathan,

thanks for this and I understand, it just seems a weird concidence I was always involved with three different VPSs in two different accounts.

More than compensation, I'm interested in knowing more about what Daniel said in terms of higher availability options available for my VPSs. If possible to contact someone about it would be great.

richard_s · Feb 18, 2025

opoloko said:
On a high traffic ecommerce store, for example, a daily or even hourly is not near enough in high traffic times.

I used the wrong terminology in my post, it can also be configured to do differential backups which is the cumulative difference from last full backup. I have not tested or used differential backups but presumably you would set cron job to run it every 5 minutes or whatever. There is option to send the differential file via email and it should be pretty small.

To restore you would use the last full backup and then the last differential file. I'm sure there is better methods but this one doesn't cost anything.

opoloko · Feb 19, 2025

richard_s said:
I used the wrong terminology in my post, it can also be configured to do differential backups which is the cumulative difference from last full backup. I have not tested or used differential backups but presumably you would set cron job to run it every 5 minutes or whatever. There is option to send the differential file via email and it should be pretty small.

To restore you would use the last full backup and then the last differential file. I'm sure there is better methods but this one doesn't cost anything.

Incremental backups are something I do for data all the time, indeed very useful. But when we talk about really big databases, they're not feasible, even less via email. The real only solution is database redundancy/sync, but that's a bit out of scope and budget for this specific project.

The point is, a service going down for short time is relatively small problem, the real problem that SHOULD NOT happen in this case (and the worrying part is how unlucky I was to be on all recent nodes incidents) is a node going down and data being LOST.

Redundancy on the same hardware (like the RAID that KH has) is good but often not good enough if there's no system in place to avoid data loss in case of a problem to the raid itself.

@KH-DanielP still curious to have a chat about higher availability options you were talking about if possible to continue that in DM or email.

richard_s · Feb 21, 2025

That script uses a differential backup, it's a single file since last full backup. Quick to restore but requires more disk space for multiple differential backups. As I said I haven't used that option, it's one of those things I have been meaning to test for about 10+ years now. My data is not super critical, I'll only have a few grumbling forum members because of few hours of lost posts.

No mention a full node failure and data lost!

opoloko

Member

opoloko

Member

opoloko

Member

KH-DanielP

KH-CEO

opoloko

Member

opoloko

Member

KH-DanielP

KH-CEO

opoloko

Member

KH-DanielP

KH-CEO

richard_s

New Member

GitHub - sixhop/AutoMySQLBackup: A fork and further development of AutoMySQLBackup from sourceforge. http://sourceforge.net/projects/automysqlbackup/

opoloko

Member

GitHub - sixhop/AutoMySQLBackup: A fork and further development of AutoMySQLBackup from sourceforge. http://sourceforge.net/projects/automysqlbackup/

opoloko

Member

KH-JonathanKW

Billing & Sales Manager

opoloko

Member

richard_s

New Member

opoloko

Member

richard_s

New Member