Friday, May 16, 2008

Successful Failures and Faulty Successes

Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin. — John von Neumann

So unless your job has nothing to do with IT, or you've been living under a rock somewhere out of Blackberry range, you've no doubt heard about the utterly terrifying Debian OpenSSL vulnerability which left all those vast 4096 bit private keys into effectively 15 bit keys. (For those of you who don't speak crypto, this means that the bad guys can guess your private key in about 32 thousand guesses - pretty trivial for any modern computer.

The problem, ironically enough, came about when Debian developers attempted to fix some compiler warnings about a function in OpenSSL that was using uninitialized memory. While normally a horrible idea, OpenSSL was using this as an additional source of entropy, to make the private keys it generated more random. Unfortunately, the actual result was to remove nearly all entropy, leaving only 15 bits behind from the PID.

I think it's pretty safe to say that this was a catastrophic failure. Now Debian has done an admirable job of releasing a fix to the tool that generates the weak certificates (though you still have to go back and replace already existing ones), the question remains - how in the world did OpenSSL exist in this blatant failure mode, completely undetected, for two years?

This problem is a beautiful illustration of part of why cryptography, and security in general, is so devilishly difficult to get right. In normal software testing, successes are successes, and they're good, and failures are failures, and they're bad. Simple enough, right?

When you're testing security, though, things are different. In security, you also have to make sure test for what I like to call successful failures, and faulty successes.

Take a firewall system, such as iptables or pfw, for example. Without one, if a client attempts to make a TCP connection, it expects to succeed. If the connection succeeds, the test succeeds; if it fails, the test fails. Once the firewall is in place and configured to block that connection, though, that connection damn well better fail! That counts as a successful failure - a case where you succeeded in selectively making something like a TCP connection fail in a case where it would be undesirable. Likewise, when a file is encrypted, unauthorized attempts to read it (or at least, extract meaningful data from it) by anyone without the appropriate key is expected to fail.

Likewise, you also have to test for the inverse case. Back to our firewall example, let's say that the admin carelessly mistyped the mask, leaving our service unprotected from ranges that we don't want to have access. Connection attempts will all of a sudden start succeeding where we don't want them to. We now have a faulty success. Even worse, we won't notice unless we happen to test from the tiny sliver of IP addresses that were erroneously granted access. Back in the crypto realm, this is what happened to OpenSSL. It failed to prevent success, where success means a request that should have been prevented was not blocked.

These additional twists on defining success and failures help to make testing security software and configurations devilishly difficult. The OpenSSL bug didn't cause any visible changes in the test results. Everything still compiled; the output was still valid; data was encrypted and decrypted properly; no regression tests failed.

Hopefully someone someday will figure out a more reliable way to test this kind of code than the current method of having people who've forgotten more about math than most of us ever even heard of stare at it until drops of blood appear on their foreheads. Until then, we'll just have to be ready to roll out patches, scramble passwords, and revoke certificates when the next inevitable vulnerability or system compromise happens.