"Plan for failure" is not a new mantra when it comes to information technology. Evaluating the worst case scenario is part of defining system requirements in many organizations. The mistake that many are making when they start to implement cloud is that they don't re-evaluate their existing architecture and the economics around redundancy.
All organizations make trade-offs between cost and risk. Having truly fully redundant architecture at all levels of the system is usually seen as unduly expensive. Big areas of exposure like databases and connectivity get addressed but some risk is usually accepted.
One of the things that change with cloud architectures is that cost and risk equation. The combination of not taking that into account and the assumption that fail-over is a built-in component of cloud is what leads to downtime.
Brian Heaton has published a great article, Securing Data in the Cloud, that walks through the Amazon cloud regional outage this past April. It shows contrasting examples of organizations that planned poorly and were affected and those who planned well and weren't impacted. It also lists six great rules for managing the risk of cloud outages:
1. Incorporate failover for all points in the system. Every server image should be deployable in multiple regions and data centers, so the system can keep running even if there are outages in more than one region.
2. Develop the right architecture for your software. Architectural nuances can make a huge difference to a system’s failover response. A carefully created system will keep the database in sync with a copy of the database elsewhere, allowing for a seamless failover.
3. Carefully negotiate service-level agreements. SLAs should provide reasonable compensation for the business losses you may suffer from an outage. Simply receiving prorated credit for your hosting costs during downtime won’t compensate for the costs of a large system failure.
4. Design, implement and test a disaster recovery strategy. One component of such a plan is the ability to draw on resources like failover instances, at a secondary provider. Provisions for data recovery and backup servers are also essential. Run simulations and periodic testing to ensure your plans will work.
5. In coding your software, plan for worst-case scenarios. In every part of your code, assume that the resources it needs to work might become unavailable, and that any part of the environment could go haywire. Simulate potential problems in your code, so that the software will respond correctly to cloud outages.
6. Keep your risks in perspective, and plan accordingly. In cases where even a brief downtime would incur massive costs or impair vital government services, multiple redundancies and split-second failover can be worth the investment, but it can be quite costly to eliminate the risk of a brief failure.
Another thing I see in that article and many others is that "cloud" doesn't mean "easy". Most organizations are writing middle-ware to work between their established processes/procedures and their cloud deployments. Part of this is that cloud enforces good virtualization practices and I suspect many IT shops have taken short cuts here and there. There are cloud-centric projects concentrating on configuration and deployment management, but not one size fits all - so expect to do some custom development as you migrate to the cloud.