Comparing Failure Domains Between Traditional and Hyperconverged Infrastructure
[dropcaps type=’square’ color=’#ffffff’ background_color=’#e04646′ border_color=”]Q[/dropcaps]uestion: Many hyperconverged infrastructure options give IT organizations “one throat to choke” so that there is no finger pointing, which is a good thing. But, does this also mean hyperconvergence increase the size of a failure when something goes wrong?
No. The size of the failure domain doesn’t necessarily change; failures will happen differently and have a different impact in a hyperconverged infrastructure. In reality, hyperconverged infrastructure has more targeted and predictable failure domains than traditional data center environments, at least when it comes to hardware failures. On the software side, it’s very much dependent on whatever happened with the software.
In a traditional data center environment in which IT manages separate resource silos, the impact of a failure is dependent on which resource fails. If a single server fails, the result isn’t generally a user-impacting outage since other servers can assume the load. However, if storage fails, that will almost certainly be impactful in many organizations. Although many companies implement redundancy in their storage environments, a great many do not have the funds to implement such services. In traditional data center environments, IT works hard to implement highly available services on a per-resource basis to protect against the potential for major outages.
With most hyperconverged infrastructure vendors, the potential for failure is assumed at the software layer. If there is a failure of a disk or of a node, the impact is limited to that node, but all resources – compute and storage – of that node are impacted. However, unlike traditional data center structures, only a piece of the overall storage cluster is affected; the rest of the cluster just keeps running. Further, because most hyperconverged solutions take a replica-based approach to data protection, the loss of a disk or a node does not affect the ability to access data. It can simply be retrieved from other nodes in the cluster.
All that said, in the event that one of the hyperconverged vendors released software that was faulty, there would be the potential for an issue that could negatively impact the entire environment. Bear in mind that every hyperconverged infrastructure vendor still runs an industry hypervisor on each of their nodes, but this hypervisor integrates with the software that manages the global file system upon which these solutions are based.
From a software perspective, that is perhaps the biggest risk when comparing traditional data center systems with those based on hyperconvergence. Because traditional data centers have silos of resources, a single vendor’s software failure would not be as likely to have a widespread impact on the environment. Again, that is dependent on which resource is impacted. If a software updated brought down an organization’s only SAN, the impact would be widespread.