Indiscriminate Prose: Too much logging is almost enough

Not too long ago, I stumbled across Log Everything All the Time on the site High Scalability. The short version is that the classic method of turning on debug logging when needed to track down a problem is useless because it implies that logging was off when the problem actually happened. Anyone who's dealt with an intermittent problem (only happens two or three times day, or one out of a thousand transactions, or if at least six people are wearing blue jackets) in a complex system can attest to the futility of trying to divine which combination of the tens or hundreds of thousands of events flowing by per day actually blew up. This is the deadly and elusive Heisenbug, an easily startled creature which runs away from the debug switch like a cockroach from a light switch. This particular article was written from the perspective of writing a large scale application. Building a system in this way has the huge advantage that you (where you is the programmer) get to drop debug and analysis code in at any point in the program flow. Suspect that data structures aren't getting initialized right? Print out a dump of the data structure right before the initialization routine sends it back. Garbage strings showing up in the database? Trace the entire lifecycle of the data that go into it, with a before and after every piece of code that touches them. Now that's all well and good for you lucky guys are busy actually building systems on general purpose computers from code up. But what about us poor schmucks who are running the networks that let those web 2.0 apps fling AJAX across the globe? By comparison, we're fighting with one hand tied behind our back while blindfolded.

Most network devices are, for all intents and purposes, embedded devices. This means the code running on them is far more rigidly fixed than on, say a Linux server. Open source apps can be patched and recompiled, PHP pages can have extra print statements thrown in, and even fussy vendor binaries can be spied on with strace and tweaked with library preloading. With your typical network device, though, unless you've got a really good service contract, you're pretty much stuck with whatever you've got.
The built in even logging on most network devices tends to be a bit on the anemic side. At best, you'll typically get reports of what the device considers to be unusual events. At worst, you'll get nothing at all. I've worked with at least one switch that taunts you by by just logging when an event occurs. No details, not even what even occurred - just a timestamp.
The state of a network is almost all transitory. Transactions on servers, even without logging, tends to leave various breadcrumbs behind - database entries, file timestamps, emails, etc. Once a state table entry for a stateful firewall expires, though, no trace is left behind.

The best remaining option is to constantly squeeze snapshots of the state of the network and track them. Then, later on, when someone comes to you with a report of a problem from three days ago, you at least stand a fighting chance of piecing together the half dozen otherwise normal events that line up to point at the real problem. For example, we once had a professor who's machine kept falling off the network over weekends. All of the usual troubleshooting techniques - look for errors on the port, test out the cabling, etc - showed everything perfectly normal. Looking through our more extensive state monitoring logs, we found two sets of interesting events, however. On Saturday morning, the port briefly lost a link, and a different MAC address started showing up on the port. Saturday evening, the link flickered again, and the original MAC address showed up again. Looking up the address showed that it belonged to a grad student working for the professor. The professor's own assistant was unplugging his machine! Without logging all of the "normal" state of the network, we'd probably never have found the real cause. So what information should you log? Here are what we log for network troubleshooting.

Switch FDB: This gives you the record of where all of your machines have been. More importantly, it can tell whether or not a machine was talking on a port at a specific time, rather than relying on a users guestimate.
ARP Tables: Utterly indispensable for catching misconfigurations. We've all had that bozo who set his IP address to the local gateway, or a nearby server. With this info, you already have a record of the offender, rather than tracking it down manually.
DHCP Leases: If you don't have fixed addresses everywhere, then without tracking DHCP leases, it's much harder to tell what machine a given IP address belonged to at a given time.
DHCP Fingerprints: By capturing extra information about DHCP lease requests, such as the option request list and the VCID, it is possible to do passive OS identification using techniques like those used by PacketFence. This can be particularly useful when tracking down disallowed or problematic devices, such as SOHO NAT routers.
IP Flows: An invaluable record of who talked to who when. You can either grab it from your routers directly with flow exports, capture it using span ports and a package like Argus, or even better, both. By capturing snapshots of IP flows from multiple points and looking for discrepancies, you can narrow down exactly which one of the three routers and five switches in the data path threw away a few packets. Also, your security guys will love this for helping to track down patient zero in virus outbreaks.
SNMP Traps: These are often your best insight into what the network devices think are going on. BGP session flapping, port security violations, failed login attempts - often times you can find events like these being generated in traps that don't even show up in the device log files.

Indiscriminate Prose

Thursday, November 15, 2007

Too much logging is almost enough

No comments: