Saturday 24 August 2013

Infrastructure collapse


Nasdaq crash triggers fear of data meltdown
Digital infrastructure exceeding limits of human control, industry experts warn


23 August, 2013


A series of system crashes affecting Google, Amazon, Apple and Microsoft in the past fortnight has brought warnings that governments, banks and big business are over-reliant on computer networks that have become too complex.

The alarm was sounded by industry experts in the aftermath of a three-hour network shutdown that paralysed the operation of the Nasdaq stock market in New York on Thursday, on what should have been a quiet day of routine share trading on the exchange.

Jaron Lanier, the author and inventor of the concept of virtual reality, warned that digital infrastructure was moving beyond human control. He said: "When you try to achieve great scale with automation and the automation exceeds the boundaries of human oversight, there is going to be failure. That goes for governments, for consumer companies, for Google, or a big insurance company. It is infuriating because it is driven by unreasonable greed. In many cases, the systems that tend to fail, fail because of an attempt to make them run automatically with a minimal amount of human oversight."

The Nasdaq collapse was caused by a communication failure between its platform for processing quotes and trades and that of another party – reportedly the New York Stock Exchange. So serious was the fallout that it resulted in a third fewer shares being traded in the US on that day.

"These outages are absolutely going to continue," said Neil MacDonald, a fellow at technology research firm Gartner. "There has been an explosion in data across all types of enterprises. The complexity of the systems created to support big data is beyond the understanding of a single person and they also fail in ways that are beyond the comprehension of a single person."

From high volume securities trading to the explosion in social media and the online consumption of entertainment, the amount of data being carried globally over the private networks, such as stock exchanges, and the public internet is placing unprecedented strain on websites and on the networks that connect them.

By 2017, an amount of data equivalent to all the films ever produced will be transmitted over the internet in a three-minute period, according to Cisco, a manufacturer of communications equipment.

Internet traffic today per person is measured in gigabytes, with six gigabytes of information exchanged per human per year. In 2017, that number will have risen to 16. By then, global data will be counted in zettabytes – roughly one trillion gigabytes.

High frequency trading by computers built to automate buying and selling high volumes of shares by hedge funds and banks has triggered and magnified the impact of IT failures on stockmarkets. In May 2010, $862bn (£553bn) was erased from the value of US shares in 20 minutes when one company triggered a cascade of selling.

"You get under the covers and high frequency trading algorithms are beyond understanding," said MacDonald. "Sub-millisecond trades taking place, tens of thousands per second, and when that fails it fails spectacularly. That is what you are seeing manifested in Nasdaq."

This month's spate of outages came to international attention with the two-hour failure of the New York Times website on 14 August, during which it resorted to publishing articles on its Facebook page. While a malicious attack was initially suspected, the problem was caused simply by a scheduled system maintenance.

On the same day, Microsoft customers began to report email failures. The outage was traced to problems with the Exchange ActiveSync service which serves email to many of the world's smartphones. When Exchange hit a glitch, the sheer volume of phones trying to connect triggered a ripple effect that took three days to control.

On 16 August, many of Google's websites, from email to YouTube to its core search engine, suffered a rare four-minute global meltdown. The episode, the cause of which Google has not explained publicly, served to illustrate the sheer volume of traffic its servers process. During its outage, one monitor put the drop in global internet traffic at 40%.

Three days later, on 19 August, Amazon's North American retail site went down for about 49 minutes, with visitors greeted with the word "Oops". No explanation was given, but one estimate by Forbes put the cost to Amazon at nearly $2m in lost sales.

On 22 August, Apple's iCloud suffered a blackout that affected a small number of its customers but lasted 11 hours. Storing the collections of photos, music, documents and address books that would once have been kept on shelves at home, iCloud now has 300 million users.

"The volume of data overall is absolutely exploding," says Rachel Dines, senior analyst at Forrester. "This week has been especially bad for downtime. Because we are now so dependent on these high profile services we notice them more. The impacts for the companies are huge from both lost revenue but also more importantly reputation damage."

James Acres, whose company Netcraft monitors outages at data storage companies, says digital businesses are racing, not always successfully, to built the infrastructure needed to cope with the data that many consumers are gradually transferring into the cloud from the hard drives of their laptops or their collections of CDRoms.

"More and more people are putting their data in the cloud," says Acres, "and to deal with this services are changing their back end to cope, and because it's all quite new they are experiencing some difficulties."As well as selling books and music, Amazon is the largest provider of public digital storage space worldwide, and this side of the business was hit by an outage in 2012 despite upgrades designed to make its servers less likely to collapse.

"The outage at Amazon last year was traced back to some of the processes and technologies they had put in place to make it more resilient," said MacDonald. "It is almost like an auto-immune disease, where the systems they created to make it more resilient actually spread the failure more rapidly."

Lanier, whose Who Owns The Future? details the concentration of power among organisations with the largest computers, said outages would increase until human oversight was improved. "We don't yet have a design for society that can run this technology well. We haven't figured out what the right human roles should be."

No comments:

Post a Comment

Note: only a member of this blog may post a comment.