|
Why would sed be slower than perl
By Delirium in Delirium's Diary Mon Mar 09, 2009 at 07:17:44 PM EST Tags: unix, text, sed, perl (all tags)
|
|
|
So I've been using sed for ages for basic text-stream munging, because I assumed it was probably as fast as you can get for that sort of thing, being a pretty special-purpose utility. I've recently discovered that almost everything I do runs much faster in Perl, though, despite Perl being a full language and whatnot.
|
|
|
ADVERTISEMENT Sponsor: rusty
This space intentionally left blank
|
...because it's waiting for your ad. So why are you still reading this? Come on, get going. Read the story, and then get an ad. Alright stop it. I'm not going to say anything else. Now you're just being silly. STOP LOOKING AT ME! I'm done!
|
comments (24) |
|
|
active | buy ad ADVERTISEMENT
|
|
|
|
|
|
As the most dramatic example, I have some files, around 2GB or so, that are ASCII but all on one gigantic line. Fortunately, there's a short (3-char) ASCII sequence that can serve as a field delimiter that doesn't exist elsewhere in the file, so I can preprocess them to split them into lines, which makes further processing more pleasant.
I used to use a one-line sed script to do this, where ABC is the field delimiter, and I want to retain A and C while replacing B with the newline:
s/ABC/A\nC/g
It took about 10 minutes, and was clearly CPU-bound.
So I replaced it with the exact same one-liner in Perl, called via "perl -pe", and it runs in 1 1/2 minutes, IO-bound.
This is GNU sed, the default one that comes with Debian. There seem to be other seds, some of which claim to be faster for some things, but many of them won't even work on this task at all, because they have line-length limits much lower than 2 billion characters, presumably due to operating in a line-oriented manner where they load each line in turn into a fixed-size buffer.
If I had to guess what the issue was, I'd guess Perl is probably doing something smarter with reading in the line and applying the regex to it blocks at a time, while sed is probably doing character-oriented I/O or using a very small buffer. In fact I had initially avoided using Perl because I thought it'd do something stupidly line-oriented like loading the first line into $_, taking up 2GB of memory before it even looked at the regex, whereas GNU sed promised to do no such things. But it looks like Perl is the one that handles this more smartly.
|
|
|