LogicMonitor says Complexity doesn’t belong in your datacenter
When designing infrastructure architecture, there is usually a choice between complexity and fault tolerance. It’s not just an inverse relationship, however. It’s a curve. You want the minimal complexity possible to achieve your availability goals. And you may even want to reduce your availability goals to reduce your complexity (which will end up increasing your availability.)
Basically, the rule to adopt is If you don’t understand something well enough that it seems simple to you (or your staff), even in it’s failure modes, you are better off without it.
Back in the day, clever people suggested that most web sites would have the best availability by running everything – DB, web application, everything – on a single server. This was the simplest configuration, and the easiest to understand.
With no complexity – one of everything (one switch, one load balancer, one web server, one database, for example) – you can tolerate zero failures.
With 2 of everything, connected the right way, you can keep running with one failure.
So is it a good idea to add more connections, and plan to be able to tolerate multiple failures? Not usually. For example, with a redundant pair of load balancers, you can connect one load balancer to one switch, and the other load balancer to another switch. In the event of a load balancer failure, the surviving load balancer will automatically take over, and all is good. If a switch fails, it may be the one that the active load balancer is connected to – this would also trigger a load balancer fail over, and everything is still running correctly. It would be possible to connect each load balancer to each switch, so that failure of a switch does not impact the load balancers, but is it worth it?
This would allow the site to survive two simultaneous unrelated failures – one switch and the one load balancer – but the added complexity of engineering the multiple traffic paths increases the likelihood that something will go wrong in one of the 4 possible states. There are now 4 possible traffic paths instead of 2 – so more testing needed, more maintenance needed on any change, etc. The benefit seems outweighed by the complexity.
The same concept of “if it seems complex, it doesn’t belong”, can be applied to software, too. Load balancing, whether via an appliance such as Citrix Netscalers, or software such as ha_proxy, is simple enough to most people nowadays. The same is not generally true of clustered file systems, or DRDB. If you truly need these technologies, you better have a thorough understanding of them, and invest the time to create all the failure modes you can, and train your staff so that it is not complex for them to deal with any of the failures.