Resilience Through Logging E035
A Bit of Security for September 12, 2024
Today we’re going to discuss logging, by diving into the world of trustworthy resilient computing. For a time while at IBM I was staff to the director of the Myers Corners Lab, aka the Poughkeepsie Programming Center. We developed the mainframe operating system MVS (now known as z/OS) along with subsystems for security, print, and error handling.
Thursday morning staff meetings were always vibrant but for a while there was one theme that kept coming back. The IT department would report on their activities over the prior week. For more than a few weeks, they would report that an outage took the system down on Tuesday afternoon, but the system was successfully restored within three hours. Good, but why did it come down? We’re still looking into it. Next week, same story – but now they got the recovery time down from three hours to two hours, 30 minutes. Why did it fail? Still working on it.
The third time, the head of the recovery component said, can we see the logs? IT said, no, we’ve suspended logging on that system. Why? Well, the system generates so much log data that it fills the up the drive, so we shut it off. Simultaneously the lab director and the head of recovery both said “Give us a full day’s logs!”
Turns out, the system was generating so much log data because it was experiencing many, many machine checks. (Microsoft Windows users know a hard machine check as the Blue Screen of Death.) Why, you might ask? In those days, when a piece of hardware couldn’t be fixed in the field, it would be returned to Poughkeepsie and replaced. If the engineering team was able to repair the box, it would go back as a spare. In this case, the problem couldn’t be resolved, so the box was dumped in the junk drawer, which at that time meant it became the host for production in the programming center.
The machine running the programming center was generating about 50,000 hard machine checks per day, and it was recovering from 49,999 of them. That’s resilience. That’s part of why the mean time between failures on a mainframe is measured in years. And that’s also why having comprehensive, forensically durable logging is crucial to a successful security and resilience strategy.
Resilience Through Logging E36 2024 09 12
A Bit of Security for September 12, 2024
What value can your logs give you? Listen to this -
Let me know what you think in the comments below or at wjmalik@noc.social
#cybersecuritytips #logging #Problemdiagnosis #troubleshooting #BitofSec