CLOUD RELIABILITY

28. December 2012, 16:03 | by WD Milner | Full Article |

The "Cloud" is supposed to be a flexible, resilient and redundant resource for computing power and data storage. But just how reliable is it?

The term cloud derives from the use of a cloud shaped symbol on network diagrams to represent a large, diffuse, external network - often the Internet. In recent parlance, it is used to represent an interconnected group of computer systems and storage networks across which computing tasks or data storage are performed. These systems may be located locally or geographically wide spread, even globally.

In 2007, Rackspace started losing cloud servers when a power outage forced the shutdown of a major portion of the HVAC system in their data centre. In April of 2011, Amazon suffered an overload in a backup network causing servers to start querying each other for copies of their data and trying to rebuild missing mirrors. This cascaded into an even larger failure as thousands of systems tried to restore lost data simultaneously.

More recently in February of 2012, Azure suffered a cascading failure when a certificate server was unable to issue a proper certificate due to a leap year glitch and prevented virtual machines from starting. The cloud host agent interpreted this as a hardware problem and moved the VM's to other servers and flagged the hardware unit. Of course the agent on the new hardware saw the same problem and handed off to the next redundant hardware, etc. creating a spreading tide of reported hardware failures and inoperable VM's. Then in June of 2012 Amazon Web Services suffered a major outage.

Traditional data centre architecture is fairly diverse. This has had the effect of isolating hardware failures to single machines or a small group. Due to the nature of cloud computing, fault tolerance is an integral part, and the the recovery process has been moved from the hardware layer to the software layer where management software identifies failures and recovers from them or routes around them. Unfortunately, when this management breaks down it can cause a major failure across thousands of systems.

The realization is growing that even single points of failure are actually sequences of events, sometimes with remedial intervention by human or automation adding to the problem. Cloud reliability is good on the whole, but needs to get better. When your data is effectively spread across hundreds or thousands of systems potentially around the globe, you need to be sure it is safe and secure. All is not so dreary though. Some cloud vendors have achieved a 99.99% up-time across their systems which is far better than many enterprises can boast.

- 30 -

Categories: ,
Keywords: cloud,reliability,storage,resource

Comments


 



Textile help
 
* Indicates a required field.

As a SPAM prevention measure, comments are moderated and will be posted once vetted.

 

Article & Comments


Comments are not enabled for all articles or documents.

Article Navigation
|

Categories

Business
Communications
Electronics
Entertainment
Environment
Government
Internet and WWW
Miscellany
Music and Audio
News
Photography
Privacy
Psychology
Security
Society and Culture
Stage and Screen
Technology
Theology
Tips and Tricks
Web Design
Web Site


The Birches - Milner.ca Support Child Safety Online

 

 
 
 Help to FIGHT spam!
 • 
  •
•••