28. December 2012, 16:03 | by WD Milner | Full Article |

The "Cloud" is supposed to be a flexible, resilient and redundant resource for computing power and data storage. But just how reliable is it?

The term cloud derives from the use of a cloud shaped symbol on network diagrams to represent a large, diffuse, external network - often the Internet. In recent parlance, it is used to represent an interconnected group of computer systems and storage networks across which computing tasks or data storage are performed. These systems may be located locally or geographically wide spread, even globally.

In 2007, Rackspace started losing cloud servers when a power outage forced the shutdown of a major portion of the HVAC system in their data centre. In April of 2011, Amazon suffered an overload in a backup network causing servers to start querying each other for copies of their data and trying to rebuild missing mirrors. This cascaded into an even larger failure as thousands of systems tried to restore lost data simultaneously.

More recently in February of 2012, Azure suffered a cascading failure when a certificate server was unable to issue a proper certificate due to a leap year glitch and prevented virtual machines from starting. The cloud host agent interpreted this as a hardware problem and moved the VM's to other servers and flagged the hardware unit. Of course the agent on the new hardware saw the same problem and handed off to the next redundant hardware, etc. creating a spreading tide of reported hardware failures and inoperable VM's. Then in June of 2012 Amazon Web Services suffered a major outage.

Traditional data centre architecture is fairly diverse. This has had the effect of isolating hardware failures to single machines or a small group. Due to the nature of cloud computing, fault tolerance is an integral part, and the the recovery process has been moved from the hardware layer to the software layer where management software identifies failures and recovers from them or routes around them. Unfortunately, when this management breaks down it can cause a major failure across thousands of systems.

The realization is growing that even single points of failure are actually sequences of events, sometimes with remedial intervention by human or automation adding to the problem. Cloud reliability is good on the whole, but needs to get better. When your data is effectively spread across hundreds or thousands of systems potentially around the globe, you need to be sure it is safe and secure. All is not so dreary though. Some cloud vendors have achieved a 99.99% up-time across their systems which is far better than many enterprises can boast.

- 30 -

Categories: ,
Keywords: cloud,reliability,storage,resource



Textile help
* Indicates a required field.

As a SPAM prevention measure, comments are moderated and will be posted once vetted.


Article & Comments

Comments are not enabled for all articles or documents.

Article Navigation


Internet and WWW
Music and Audio
Society and Culture
Stage and Screen
Tips and Tricks
Web Design
Web Site

The Birches - Support Child Safety Online


 Help to FIGHT spam!