Monday, February 28, 2011

Working around MySQL dain bramange

I'm currently in the process of porting a perl app that was written for MySQL over to a PostgreSQL backend. Originally it was targeted for MySQL 3.x (yes, the app is that old), long before any kind of transactions or constraints were available, so data updates use this general idiom (those of you used to full-featured databases, you may want to avert your eyes):

  1. High level function locks tables
  2. Perform queries on database to check for consistency (uniqueness, business rule checks, etc)
  3. Pass parameters to lower level update function
    1. Perform global permissions checking
    2. Perform single update/insert/delete as appropriate
  4. Call lower level functions additional as needed
  5. Unlock tables

Later on, in the 4.1 days, rudimentary transactions were added, though constraints were still nonexistent to useless. Simple transaction support was shoehorned in by adding it to the lower level functions.

  1. High level function locks tables
  2. Perform queries on database to check for consistency (uniqueness, business rule checks, etc)
  3. Pass parameters to lower level update function
    1. Begin transaction
    2. Perform global permissions checking
    3. Perform update/insert/delete as appropriate
    4. Commit or rollback transaction
  4. Unlock tables

(I warned you not to keep reading, didn't I?)

So let's recap where we are now, thanks to sticking with what Selko describes as "not really a relational database, but does sequential processing behind a 'near-SQL dialect' language":

  • No constraints in the database, making direct edits (and let's face it, they always happen sooner or later) are quite dangerous.
  • Because there are no constraints, all of the checking has to be done in the application layer.
  • Because the checking is done in the application, all involved tables still have to be locked - MVCC would actually make concurrency issues worse, not better.
  • Because transactions are handled in the lower level functions, they can't encompass multiple statements. What's the point? (Okay, to be fair they do cover auditing records that the lower level functions silently create, but that's all.)

Ugh. The ideal path going forward would be move all of those constraints down into the database where they belong. This is far from a trivial task, however, and one that I probably don't have the resources for. The checks are scattered throughout the codebase, with odd conditions awkward to express in SQL. One common idiom is a column that references table A if the value is positive, table B if negative, or nothing at all if its zero. If I could wave my magic keyboard, that's the road I'd go down, but it's out for repair, and I need a working app sooner rather than later.

Instead, I'm going to have to spring begin/commits throughout the higher level functions, while still preserving the table locks everywhere. To make things even better, some of the higher level functions may call each other, so I need to simulate nested transactions, which PostgreSQL doesn't quite support (yeah, I know about savepoints.) So now I'm looking at something roughly like this:

  1. If not already in a transaction, begin one.
  2. Lock tables.
  3. Perform checks.
  4. Call low level functions, or other higher level functions as needed.
  5. If we began our own transaction, commit, else pass control back up.

Yeah, it's still pretty fugly, but that's the design tax we're still paying for all of the features left out way back in MySQL 3. Thanks, guys.

Converting MySQL blob data to PostgreSQL bytea

I just went through several annoying iterations of preparing to load a MySQL data set into a PostgreSQL server, with somewhat less than helpful error messages (why on earth would a bytea column complain about charset errors anyway?) Now that it's working, I though I'd share this little function. It's suitable for use in a script that's post-processing the txt files from running mysldump with the -T option. It takes one argument, a binary strong from a blob column, and returns the same data in an escaped form suitable for loading into a bytea column with the copy from function.

May it save you from a few of the headaches I have suffered.

sub encode_bytea { my ($blob) = @_; return join('', map { $_ = '\\\\' . sprintf("%.3o", $_) } unpack("C*", $blob)); }

Saturday, February 19, 2011

How to Lose A Customer in 7 Seconds or WTF, Mozy?

Ah, Mozy. I first signed up with you a few years ago, when it was basically between you and Carbonite. Almost immediately, you saved my bacon (or, more specifically, a couple dozen gigs of photos and music) from a hard drive crash. Not long after that, I laughed at the poor Carbonite users, their backups trashed by a combination of faulty hardware and lack of cross data center replication.

But that was then, and this is now. Hard drives are cheap, bandwidth is plentiful, and everyone is doing "cloud stuff". So how do you respond to the flocks of competitors? Lowering prices? Adding innovative new features? (Yeah, backing up to local hard drives is a good new feature, but hardly one I'd call innovative.) No, not you! Instead you not only drop the unlimited option, you also raise prices!

I'm currently backing up about 75G of data from a single PC (man, those high megapixel cameras can crank up the file size!) This means the price for me doubled from about $50 to $100 per year. And what do I get for that extra $50? Nothing. Nothing at all. I can't run it on Linux or back up to friends machines for free, live CrashPlan. I don't get to keep my costs down around $50, like Carbonite. I can't sync my backup set between multiple machines, like Livedrive. I can't play my music or view my photos online or on mobile devices, like SugarSync. I get nothing.

I've never been a big fan of paying something for nothing, so farewell Mozy. I've chosen to go with CrashPlan. The browser based file access, weekly status reports, and cross platform support hit all of my needs at the right cost. Although, to be fair, I never would have bothered to find CrashPlan if Mozy hadn't decided to shoot themselves in the foot. I'd still be giving Mozy my money, for fewer features. I guess that puts things in a different light.

Mozy, you may be doing your best to bleed customers, but your sacrifice shall not go completely unappreciated. As a formerly happy, but now happily former customer of yours, you have my gratitude.

Tuesday, August 24, 2010

DNS Sinkholes... That Work

One idea that's been ramping up in the last few months is the DNS sinkhole. Essentially you configure your local client facing resolves to always return an IP address of your choice for a list of "known bad" domains. The list of domains will usually come from some trusted source, and the IP returned will either be a completely non-functional one (like a 127 address), or one that points at your local honeypot server. Assuming that you've got a good list of malicious domains, it can be a great way to block viruses and other malware without having to touch any of your clients.

The only problem, though, is that most of the examples I've found... well, kinda suck. They're either limited in scope, less efficient than they should be, or just don't work. So rather than complaining about them all individually, here's the set up that Works For Me®.

First, you need a minimal zone file. All of the malicious zones will get served out of this zone file, so you are restricted to records that will be common to all. This example uses 127.1.1.1 for the target IP address, substitute your own if you want to send the traffic somewhere else.

$TTL    600
@                       1D IN SOA       localhost root (
                                        42              ; serial
                                        3H              ; refresh
                                        15M             ; retry
                                        1W              ; expiry
                                        1D )            ; minimum

        IN  NS  localhost.
        IN  A 127.1.1.1
*       IN  A 127.1.1.1
  • The SOA and NS records are both just set to 'localhost', avoiding hitting any real records, but still resolving.
  • Pick any serial you like - it should never change, only the list of zones pointing at it.
  • The first 127.1.1.1 record causes BIND to return a record for the zone itself. For example, if you pointed "badstuff.com" at this zone file, it would allow any queries for badstuff.com to return 127.1.1.1.
  • The second 127.1.1.1 record handles anything within badstuff.com - www.badstuff.com, download.badstuff.com, etc. You need both records to get full coverage.

Take that zone file, and stuff a copy of it on all of your servers. Here, rather than bothering to set up the usual master/slave relationship, I just made all of my servers master - much simpler, since none of the usual benefits of master/slave configurations apply.

Once it's in place, create a config block in your named.conf (or included file) for each zone that you want to filter (getting the list of zones is outside of the scope of this post):

zone "badstuff.com" {
    type master;
    notify no;
    file "Malware.zone";
};

This will cause your server to start answering all queries for badstuff.com and anything under it out of the file Malware.zone instead of issuing a recursive query for it. Make sure you include the 'notify no;' line - that will prevent your servers from trying to flood each other with useless notifies.

Here are a few other links you should follow if you're serious about doing real DNS sinkholes.

Friday, May 1, 2009

Wow.

I hereby nominate this Schneier post for the Understatement Of The Year Award:

Nuclear war is not a suitable response to a cyberattack.

The very fact that such a response was needed is a depressing one indeed.

Tuesday, March 24, 2009

So long, and thanks for all the transactions...

Well, it's been a long time coming, but it's finally happening. The breakup will take quite a while, and it may never be a complete break. But in the end, after hundreds of tables and countless rows, I begin the process of replacing MySQL with PostgreSQL as our primary database.

But why, you ask? Not for any of the reasons you're probably thinking of. It's not because of any of the limitations of any of the foreign key implementations. It's not because of the cases where MySQL's serial data processing nature conflicts with what set theory says you should be able to do, such as modifying a table that's referenced again in a subquery. It's not because PostgreSQL is substantially closer to fully standards compliant SQL, and therefore a lot of the commercial big names such as Oracle. And no, we never had any catastrophic data loss due to any less than ACID aspects or a MySQL crash.

No, overall MySQL has been quite good to us. We're switching for the simple and inescapable reason that, unlike PostgreSQL, MySQL can not store IPv6 addresses in a usable form.

Now before you come up with some bright idea, take my word for it - I've already looked at it, thought it through, and utterly rejected it. Once you strip away the slightly funky formatting conventions, an IPv6 address is at its core simply a 128 bit integer. Too bad that MySQL only supports up to 64 bit integers, meaning that the IPv4 method (store it in a 32 bit integer) doesn't work. You can stick it in a text string, but then you get formatting normalization issues, and you can't efficiently perform any bit masking operations, which are required for things like determining which subnet a given address belongs to. Cramming it into a decimal type is a little bit closer, but sadly the bitwise operators such as '|' and '&' will silently truncate the output to 64 bits, making it - and anything else - utterly useless.

Quite simply, as comfortable as we are with MySQL, IPv6 addresses just don't fit.

And then, like a ray of sunshine, there's PostgreSQL! It actually has a pair of native data types (inet and cidr) that can not only cleanly store either an IPv4 or IPv6 address and subnet mask or prefix length, but also work with a number of functions for some of the most common operations, such as extracting the network address or calculating the broadcast address. What is currently a flat out impossibility in MySQL is not only possible, but trivial in PostgreSQL.

My group is responsible for maintaining the network here. We live, breathe, and die by IP addresses. We register them, track them, and shuffle them around between databases, DNS, DHCP, and ACLs a hundred times a day. Imagine trying to sell a database to the phone company that can't store phone numbers, and you'll have a pretty decent feel for the position we're in with IPv6 addresses. IPv6 adoption may be slow now, but it's going to come sooner or later, and when it does come, it's probably going to hit critical mass and come fast and hard.

Sure, I'm not looking forward to the changeover. I need to worry about removing MySQL specific features from the schema, such as auto increment fields and set/enum data types. I have several gigs of absolutely mission critical data that has to be moved over without getting scrambled. And there are many, many lines of perl and SQL code that have to be tested thoroughly. But in the end, "difficult but possible" beats "not a chance" any day.

Thursday, March 19, 2009

Patents Vs. Innovation

If you follow the tech industry at all these days - and possibly even if you don't - you can't help but hear all about patents. Whether it's Microsoft suing TomTom over ridiculous FAT patents, or RedHat reluctantly assembling a defensive patent arsenal, the one thing that every agrees on is that patent law and practice has a huge impact on both the tech sector, and the economy at large.

One aspect that the guys over at Techdirt have continued to hammer at is the difference between invention, and innovation. Patents place a significant emphasis at protecting the "rights" of the inventor (rights which exist only because of patent law - how's that for circular reasoning for you?), to the point where innovation, the act of making the invention actually useful, is being harmed. I've come up with a simple analogy that I think helps clarify this argument.

Imagine that you're a manager with a couple dozen employees under you. (If you actually are a manager, this should be pretty easy.) Now, odds are that most of your employees are pretty decent. They're good at their jobs, you can rely on them to produce good quality work, but they don't often come up with game changing ideas.

Except for that one guy. You know, that one guy who, if you had to pick your replacement, you'd name in a heartbeat. He's the one who doesn't just come up with a way to make a work process faster, he shuffles things around and makes that entire process disappear, freeing up everyone's time to work more on things that make money.

Now picture he's come up with a new version of some form that everyone has to fill out fifty times a day. The new version is faster, more accurate, and will allow everyone to increase their client billable hours by 20%. Great! Everyone can start using it, the company makes more money, everyone gets raises, and everyone's happy.

Except that in this case, we're modeling our little company after the patent situation we're in. Now, the only person who can use the new form is your one go to guy. Everyone else is stuck with the old form. The company bottom line barely moves, no one gets the big raise, and everyone gets to just watch the one guy doing interesting work while they spend their day filling out the form they're not allowed to use because they didn't come up with it.

So I ask you this: as the manager, which set of rules would you rather try to run a company under?