Chainsaws and scalpels

I work in Perl – a lot. As a result, I’ve got used to working in Perl, and have maybe grown a little too reliant on it, and some of the features it provides…

I’ve got a pet project I’m working on. It is, somewhat surprisingly, not written in Perl (which is usually my weapon of choice, but writing a Windows native application in Perl is a little tricksy…), so I’m using C++ (for no other reason than I could do with the practice).

This project is processing text. Immediately, “regular expressions” springs to mind, and as the well known quote goes, I now have two problems. Well, three, if you count the fact that C++ has no built-in RE engine… 🙂

So, we’re back to good, old fashioned string manipulation. I remember doing it in assembly language (which certainly wasn’t fun), and C++ provides little more than the standard C library. I’m aiming for a wide range of supported platforms, so I’m not keen on having too many dependencies that might make it easier. Time to man up and deal with it.

As it turns out, string manipulation isn’t quite as bad as I remembered (or I’ve just got better at it over time). Some of the cases where I’d use a regular expression to extract information could quite easily be done using a couple of standard functions, which led me to think: am I relying too much on the (RE) chainsaw when I could be doing a lot more with a (string manipulation) scalpel?

In terms of code readability, string manipulation is slightly ahead (IMHO). I’ve got one particular RE that I’ve written that, when whitespaced and commented to give even the barest description of how it works, runs over 25 lines. Anyone trying to work with it would need to have a pretty solid understanding of both regular expressions and the data it’s trying to parse, and it’d be extremely easy to make a simple mistake and ruin the whole thing. If I’d done it with string manipulation (and I’m seriously considering going back and taking a swing at it) then I’m sure I wouldn’t end up with a screen full of line noise, be able to have better debugging of what’s happening within the processing (I could dump the string and its parts as they’re being worked on) and lower the entry requirements to maintaining that particular function. Hells, even I’m scared of touching it, and I wrote the damned thing…

In terms of code execution times – I’d have to benchmark it, but I don’t imagine that multiple, simple string manipulation functions are going to take that much longer than the RE engine firing up and doing its stuff. Even if it is, there’s still the question of the read/maintainability tradeoff – would those extra microseconds of CPU time really outweigh the cost of someone staring at a screen full of RE for half a day trying to figure out what it’s doing and how to change it without breaking it?

So, true to my intention (it’s not a New Year’s Resolution as such) to venture out of my comfort zone a bit more, I think I may continue to consider string manipulation first, with regular expressions as a backup plan (no point being a damned fool about it – sometimes you need to use the chainsaw). Having the option to throw a regular expression at a problem is nice, but I think the scales have somewhat tipped towards code brevity rather than maintainability – that the single RE is always going to be “better” than several simple functions. With a pretty extensive page on Wikipedia about string manipulation functions across different programming languages, I think this may be a pretty useful tool to (re-)add to my arsenal, and I think that in future I’ll be writing a lot fewer regular expressions in my Perl code.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.