People Problems at the NYSE

Recent newspaper articles tried to explain the 4-hour downtime on Wednesday, July 8, 2015, which was experienced by the New York Stock Exchange (NYSE) during the trading day. The knee-jerk reaction was that it was a coordinated cyber attack, since The Wall Street Journal home page and United Airlines had widely-reported outages the same day. When that theory was negated, attention turned to how such a prolonged outage at the NYSE could have happened. After all, aren’t core financial systems backed-up and ready to fail over instantaneously to redundant systems and then to disaster recovery sites if the on-site backup systems don’t work?

Finally some reporters came to the conclusion that the root cause of the outage was the Draconian firing practices by ICE (Intercontinental Exchange), on taking over NYSE Euronext, which eliminated of 40 percent of NYSE’s staff including those with decades of experience running the NYSE’s systems. Apparently this loss of expertise meant that those remaining, with relatively less experience, had much greater difficulty in attempting to bring the systems back.

I can relate to all of the above. In the 1970s I worked for SIAC (Securities Industry Automation Corporation) beginning soon after its inception. SIAC was formed from merging of the IT (then EDP) department of the NYSE with the IT department of its much smaller sibling, the Amex (American Stock Exchange). At that time, SIAC ran both exchange’s computer systems supporting the two trading floors and back-office clearance and settlement systems. The NYSE had higher-end expensive redundant trading-floor systems which were designed to fail over automatically to “hot” backup systems not once, but twice, with no loss of transactions. The Amex, in contrast, ran two independent simpler and cheaper systems. If one failed, Amex operators threw a physical switch from one system to the other and the couple of transactions not processed in the switchover were re-entered manually.

During the 1970s computers were orders of magnitude less reliable than they are today. Hardly a day went by that the huge (for that time) IBM computers had to be IPLed (or rebooted) following a component failure or a software abend (abnormal end). These procedures often meant that trading had to be halted until the systems could be brought back up. To overcome such outages, the NYSE installed what was called “large core storage” or LCS. LCS, as the name implies, was high-speed memory comprising tiny donut-shaped magnets with wires running through them. The idea was that LCS would be shared among all three active computers systems … the primary system, hot backup and warm backup. If the active system went down, the hot backup would pick up instantaneously without the loss of a single transaction using up-to-the-second data stored in the common LCS. If the backup system failed also, then the third system would be brought online and would also pick up the most recent transaction data from the LCS.

Much to everyone’s surprise, the NYSE experienced a whole series of outages over a given period of time compared to hardly any downtime at the Amex. How could this be? It turned out that, from time to time, the active NYSE system corrupted LCS data, which caused it to fail. The other two systems, in turn, picked up the corrupted data and also failed. The simple answer to the problem was to isolate the systems and simplify recovery as the Amex had done. Here simplicity trumped complexity.

My other experiences, which support the claim that ICE had iced experts who could have possibly handled the recent outage more expeditiously, were during my tenures as a senior IT executive at two financial services companies. It became very clear to me, in both cases, that the health and well-being of computer systems frequently depend on a handful of individuals, many of whom were involved in the development of the original systems and had experienced a host of problems, which they had learned to fix through bitter experience. Most of the knowledge was in their heads, even though documentation on the applications and operational procedures usually existed. I learned to value those individuals and depend on them to deal quickly, effectively and selflessly with issues as they arose. They were the real heroes of keeping the systems humming.

Post a Comment

Your email is never published nor shared. Required fields are marked *