Indiscriminate Prose: December 2007

Wednesday, December 19, 2007

Coding and Complexity

First off, let me just make a quick confession - while my undergraduate degree was stamped with "Computer Science" as my major, I don't really consider myself to primarily be a programmer. Sure, I do actually spend a good number of my days mucking around with writing code (usually Perl, occasionally Ruby), but my job is really IT support, specifically networking. I deal with switches, routers, wireless, VPN, and a handful of Linux servers supporting the network with DNS, DHCP, etc. The code writing that I do is almost exclusively to support everything else, such as working on a host registration system or device monitoring scripts. The software I write is to directly address a need, rather than to be sold to address someone else's need.

That said, when I just read Steve Yegge's latest rant, Code's Worst Enemy, it struck a chord with me.

I happen to hold a hard-won minority opinion about code bases. In particular I believe, quite staunchly I might add, that the worst thing that can happen to a code base is size.

Now, as someone who does not consider himself a spectacular coder by any means, I would certainly feel quite daunted by tackling a 500k line codebase by myself. On the other hand, as a professional coder, Stevey ought to be able to casually fling around great swaths of code, using advanced software repositories and indexing tools, right? But no - he feels that, all other things being equal, less is more.

One feature of large code bases that I think he gave short shrift to was the idea of complexity. He talks a little bit about how complexity certainly makes a given code base harder to work on, and also that some of the automatic tools, such as refactoring, that try to deal with it just make the problem worse by bloating the code base even more.

This is something significant in his argument, I think. In this example, we have two code bases, before and after being run through the automatic refactoring tool. The initial state has a given level of functionality, size, and (for lack of a better word), "goodness". The final state greater size, and therefore less goodness, but identical functionality! This mirrors his stated goal of taking his existing game, and rewriting it with identical functionality but less than half the lines of code.

I think the explanation boils down to this: we can only fit so much in our brains at a time. Great programmers can mentally swap in more of the big picture at once, but everyone has their limit. This limit is why we decompose programs down into manageable subroutines, each of which can be understood (at least partially) in isolation from the rest. It is why we hide massive chunks of functionality behind a handful of calls into a library. The smaller chunk size we're working on, the more likely we are to be able to fully understand it and not screw up.

From here, the trick to making sense of Stevey's size argument is realizing that there are two completely different kinds of complexity at play here. If you're writing code to do, say, an FFT, you've got to know the math behind it and how it works. That's a fair bit of complexity that you've got to hold in your heard, and it's going to remain constant regardless of whether you're developing in Java, Ruby, C++, Assembly or BF.

This invariant portion of the complexity is what I call inherent complexity. (Please don't tell me if that term isn't original; I know if probably isn't, but I like to pretend.) It's the piece that you can't get away from, since it's what defines the actual problem you're trying to get that hunk of copper and silicon to solve for you. It's the tax code embodied in Quicken, the rules of mathematics in Mathematica, the graph theory in Garmin and TomTom. Remove the inherent complexity from a problem, and all you've got left is a very complex, boring video game with executables instead of high scores and compiler errors instead of health damage.

If the inherent complexity were all there was to it, then knowledge of the problem domain would be all that's required. You wouldn't need a programmer to write Mathematica, just a mathematician to sit down and tell the computer everything she knows about math. Easy, right?

Sadly (or fortunately, if you make a living as a programmer) this is not the case. The person coding has to know extra details that are outside of the problem domain, like the fact that the number 0.1 cannot be represented with absolute precision in a floating point number. Or that if you accidentally tell a computer to loop forever, it will do so. Or that each of these three different sort routines will produce the same final product, but the memory and time requirements can vary by an order of magnitude or more - and not always in the same order, depending on the data set. Not to mention nitty language details, like dealing with pointers in C or "bless" in Perl.

All of these other layers upon layer of crap that gets wrapped around the real problem is just extraneous complexity. I mean, let's be honest - learning objected oriented design or unit testing may help you write code faster and with fewer bugs, but won't help with bullet point one of the design requirements for an ERP (or online order system, or factory automation, or... ). It's all work that is, in the end, unquestionably important to creating a finished product, but any time spent working on that extraneous complexity is time not spent on the inherit complexity.

Or, to put it more bluntly, any time you spend appeasing your programming environment is time that you're not spending on solving the actual problem.

Based on this, the best development languages are ones that are fairly thin, succinct, and in general just get the hell out of your way and let you work. Go back a few decades, and compared to the alternatives of the time, this is what C was. The book that was for many years the definitive guide to C was under 300 pages long, and let the programmer almost completely ignore the messy details of things like programming in assembly. Loops and conditionals were suddenly a simple, easy mnemonic syntax.

More recently, I think this "thinness" is a huge portion of the success of Ruby on Rails. Starting from a database schema, you can literally create a functional skeleton application in minutes with just a few commands, with all of the components already laid out neatly organized and slots already created for niceties such as porting to different databases, unit testing, and version control.

Sure, it's all stuff that any competent programmer can easily handle, but automating it frees up that many more brain cells to do whatever it is the client or employer wants to give you money for.

Monday, December 17, 2007

Frank's Law of Foreign Key Constraints

While bouncing around between a handful of typical LAMP style applications, I've come to a harsh realization of a brutal truth:

Those who do not learn proper foreign key constraints are doomed to create an incomplete, buggy implementation of them in their application.

Minus 50 million points to MySQL for creating an entire generation of web programmers who have only a vague, fuzzy idea of what constraints are by shipping a version that either didn't have them, or defaulted to a table type without them, for so long.

Wednesday, December 12, 2007

Blacklists and You

Blacklists. Whether they're for virus signatures, firewall rules, or spam filters, every security guy who's spent more than 15 minutes in the business knows then, loves them, and hates them. Coding Horror has a mostly right article up summing it up titled, quite simply, Blacklists Don't Work.

On the one hand, all of the downsides he lists are dead on. Most of the reason that we frantically run around installing anti-virus software on Windows boxes are directly traceable to horribly shortsighted design decisions made as far back as MS-DOS. (Heck, search around, and you'll still occasionally find people having problems due to 8.3 filename restrictions!) And yes, blacklists are horribly inefficient, a royal pain the maintain, and often easily bypassed. After all, there's nothing whatsoever stopping our Evil Virus Author from taking his latest malware and running it through the dozen most popular virus scanners to make sure it slips by all of them.

But really, what are the other options? Are we to truly believe that there is some magic silver bullet waiting in the wings, parked next to the car that runs on water and an eclipse plugin that can tell when you typed ">" but meant ">="? Jeff puts forward the same idea that Microsoft has been painfully pushing in for years - forcing users to run as regular users instead of as administrators all of the time. Now, to be sure, this is absolutely something worth pursuing, both for security and general reliability issues. Ask anyone who maintains an open lab on a college campus how much fun it is trying to keep the right printer drivers installed and working when anyone can do anything they want on the machines!

Even this idea falls short, though. Most of those lab computers and corporate desktops, where you have site administrators who can hoard admin privs to themselves, aren't the real problem. Those computers are the ones with people babying them already, making sure passwords are strong, patches are up to date, and virus scanners are running. Sadly, it falls short when applied to Aunt Millie. She will gleefully open that email from her anonymous new best friend, follow the directions to open the encrypted zip virus, and do whatever is necessary to firmly embed the virus deep in her computer.

Even if you take away administrative rights, in a few months those same hackers will quickly start installing programs in My Documents, and use the same startup mechanisms that legit apps do. After all, it's not like you really need full system control to send spams or participate in a DoS attack. And if you do, once you get a program running on the computer, there are usually plenty of privilege escalation bugs and attacks that can get you the rest of the way, regardless of what level the user launched the program at.

The problem is that, as bad as they are, it's not quite fair to say unconditionally that blacklists don't work. They're slow, annoying, have lots of holes - in other words, they work quite horribly - and, like democracy, also happen to work better than any other workable solution out there right now. I'll agree 100% that we need to start building systems where security is just as important a design goal as reliability and profitability, but until we figure out a way to divine the intent of a given program, some form of blacklisting will always be with us.

Sunday, December 2, 2007

Shared dedicated or dedicated shared?

I like having internet at home. Sure, it's not quite the same as multiple 30M+ pipes at work, but it's plenty fast enough to waste time on youtube and settle arguments with wikipedia. These days, most people have pretty much two options for home connections with decent speed: DSL over phone lines, or cable modem over CATV lines. (At this point, I'm not really counting FIOS yet.)

Now the primary thing that you want from an ISP is a reliable, fast internet connection. All of the other fluffy, feel good benefits like more free email addresses, little bits of web storage, etc don't really count for much if your web pages take minutes to load. One of the little canards that DSL providers love to throw that really, really bugs me is the "DSL is dedicated! Cable is shared!"

I'm a network guy. I build and maintain 'em for a living. Now, it's true that with cable modems, the bandwidth is shared per coaxial segment among all of the customers on that segment, while each DSL customer gets to use all available bandwidth on that particular dedicated pair of lines. But guess what all those dedicated lines do? That's right, they go into a set of equipment (routers and uplinks) that are - horrors! - shared.

There isn't a network on this planet that doesn't do some level of oversubscription. Cable modem providers simply have to allocate enough bandwidth to each neighborhood loop to satisfy the actual demands, just like DSL providers have to do with their aggregation points.

Now, when an ISP starts advertising with promises of no hidden bittorent filters, secret P2P filters, or anti-criticism termination clauses - in short, the things that the Net Neutrality people have been lobbying for - then I'll care.