A little while ago, we wrote a blog about IT’s biggest downfall – the people involved. That post talked about the NHS mass email fiasco, and was in itself inspired by 2 IT disasters – some printers in the US that started spewing anti-Semitic messages and a Twitter AI bot who had a meltdown and ‘learnt’ how to be a racist, hate filled internet citizen. I’m not going to tell you any more about those 2 posts, you can go and read them any time you like and have a good giggle. But now, we have another, slightly more serious scenario that has really brought it home to many people just how much of a risk the human factor is. I’m talking, of course, about BA.
On the 5th of June, over 75,000 travellers were left stranded and confused as an IT meltdown caused chaos for the airline. The IT system malfunction affected all flights worldwide, with travel manifestos disappearing and every single flight being cancelled. The problems prevented scheduled flights from taking off, incoming aircraft were unable to park at their gates and passengers were stranded on the tarmac. The airline was unable to rebook passengers onto other flights as a solution, leaving hundreds of thousands of travellers stranded. So what on earth happened to cause such a widespread mess?
An engineer disconnected a power supply somewhere in a datacentre, and when he plugged it back in, a power surge knocked everything out. That’s all it was. A guy in a datacentre unplugged the wrong wire, and over 75,000 people were thrown into turmoil.
The thing is, there are some glaring holes in this story. For starters, BA has two datacentres near Heathrow that are about a kilometre apart (we’ve done our homework). The point of having 2 datacentres is so that if something goes wrong in one, the systems can fail over to the second and keep running as normal, allowing them to solve the problems with the first. So how did a power surge affect both datacentres? Why did the UPS (uninterruptable power supply) not kick in? Even if too had failed and it didn’t work, there were plenty of back-ups and fail safes in place, so why didn’t they do their job and pick up the slack. In a system that is tested for failure on a regular basis, it seems unbelievable that a simple power surge knocked out all of their systems in such an instant and widespread manner. Datacentres routinely do tests for power failure, and large centres usually have 4 or 5 generators all powered up and ready to take over within milliseconds of a failure. It’s part of the reason businesses dealing with large volumes of mission critical data opt to use them – it’s all of the hardware, security and back up for a fraction of the cost of a self-installation.
I don’t think we’ll even know for sure if this is the real reason, or if there was more to it. Concerns about budget may have led to BA cutting down on infrastructure, so such a simple thing could have wiped out their systems without the extra fail safes in place. Or it could have been a more calculated internal attack, designed to cause chaos. Only those in the room at the time know exactly what happened, so all we can do is guess, or trust their explanation. But what we do know is that that poor engineer has the wrath of nearly 100,000 angry travellers coming his way.
At All Your Computers we love a good scandal, a chance to explore why the story might not be quite right and talk all about the fun IT bits in between. To keep up with our ramblings, follow our blog or subscribe to our newsletter. Or if you have any questions about making sure your business never fails thanks to a power surge, just get in touch with me and we can talk shop.