Born at the end of 1945, I do not come from a computer literate generation and therefore took some 3-4 years out of my working life in my 40s to learn about Information Technology. In this sense I was eager to learn and took every opportunity to attend lectures and talks that would broaden my knowledge even though sometimes I barely understood the topic properly. In time it did all come together and led me to a new career in IT.
There was one talk I attended that had this opening statement: “There are only two kinds of people in this world, those who have lost data and those who are going to !”
The following is just a simplified and deliberately broad brush picture but it is important to grasp in order to understand what might have gone wrong at British Airways over the Bank Holiday weekend.
The talk I attended was about data backups but you can better think of them as “systems restores” because it isn’t just the user generated data you are restoring but all the operating systems and file handling software too. Obviously in a single server environment, the IT services are shut down whilst the restore is taking place but once it is completed, a quick reboot and you are back up and running again.
The next stage beyond this would be a “mission critical” server which might be email or be running a database. The idea of “mission critical” is best thought of as an IT service that MUST be up and running 24/7. Now obviously, things can stop working for all manner of reasons therefore a typical solution would be to have paired identical servers configured in “fail over mode” and quite simply what happens is that if one machine fails, the other part of the pair takes over seamlessly and without any human intervention, it is totally automatic.
In my personal experience, the two common mistakes with both these systems is the following:
With backups nobody bothers to check whether the backups which are automated, have actually run and if so, was the integrity of the data okay plus they never actually practice doing a total restore from a backup to test both their system and themselves. I have found several sites in my time where because of this, companies have found following a system crash that they have lost data because the only valid backup they have is 3 months old all subsequent ones having failed but nobody was aware of it at the time.
With failover servers, nobody has ever tested them by deliberately switching one off to see if the other takes over automatically or worse still, they suddenly discover that yes the failover does work because it did sometime ago (don’t know when), and one of the servers is actually dead but again, nobody checked.
In the Present World
An outfit like British Airways is global and we know that the systems crash stranded BA passengers worldwide so their IT systems will be far more complex than my two examples above and yet, the principles are identical even though you are using multiple data centres and software systems globally. Eventually what went wrong will come out but whatever sugary words they use, the simple fact of the matter is that they were totally unprepared for what happened and the blame for that must rest wholly with the management because they didn’t do their job properly. By not doing their job properly in the first place, they will not only have cost their company some £150 million but also in the process, totally trashed the BA brand
The following is a quote from an article on ITPro : http://www.itpro.co.uk/strategy/28728/british-airways-ceo-blames-it-outage-on-power-supply-issue?
“ On Sunday, Cruz offered an update, saying the IT teams had made progress trying to fix the problems. “Many of our IT systems are back up today and my colleagues across the airline are working very hard to build back our flight programme and get as many of our customers as possible away on their travels,” Cruz said on Sunday.
He told Sky News that the IT problems “have all been local issues around a local data centre”, denying union GMB’s claims that the problem could have been avoided had BA not outsourced IT jobs to India in cost-cutting measures.
However, BA has not detailed the nature of the power supply issue behind IT problems on one of the busiest weekends of the year, nor explained why backup systems weren’t in place to prevent such a major outage happening.”
I cannot in all honesty comment on what caused their problems let alone how to fix them except to say for a business so totally dependent on these systems, they really should never have been in this position, ever because whatever the cause, they clearly did not have a “Disaster Recovery Plan” tested and in place. However there is one very clear message that all airlines should perhaps think around and that is communications. The biggest gripe from every passenger stranded at Heathrow that had a microphone stuck up their nose by a TV crew was that “There is no information” and that becomes really annoying if you have a couple of children and are stuck in an overcrowded Departure Lounge going nowhere.
The problem for the BA staff on duty too was quite simply that because the whole system was down, they didn’t have any communications either to pass on to passengers ! The vast majority of people who can afford to fly will have smartphones with them which they may even use to check in so it would seem that although the airline runs an automated and fully integrated system, they really need to have a separate communications system too.
It would require some imagination but it would not be impossible in a repeat of this type of situation that via text, email or Facebook, they could reach out to individual customers to keep them informed. People understand that shit happens, they don’t like it but keeping them informed, letting them know that the Company is aware of them as individuals, is crucial to retaining your brand ethos. Handle it the way BA did over this past weekend and you have trashed your brand and made the possibility of you becoming redundant, highly likely.