Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Why would sed be slower than perl

By Delirium in Delirium's Diary
Mon Mar 09, 2009 at 07:17:44 PM EST
Tags: unix, text, sed, perl (all tags)

So I've been using sed for ages for basic text-stream munging, because I assumed it was probably as fast as you can get for that sort of thing, being a pretty special-purpose utility. I've recently discovered that almost everything I do runs much faster in Perl, though, despite Perl being a full language and whatnot.


ADVERTISEMENT
Sponsor: rusty
This space intentionally left blank
...because it's waiting for your ad. So why are you still reading this? Come on, get going. Read the story, and then get an ad. Alright stop it. I'm not going to say anything else. Now you're just being silly. STOP LOOKING AT ME! I'm done!
comments (24)
active | buy ad
ADVERTISEMENT
As the most dramatic example, I have some files, around 2GB or so, that are ASCII but all on one gigantic line. Fortunately, there's a short (3-char) ASCII sequence that can serve as a field delimiter that doesn't exist elsewhere in the file, so I can preprocess them to split them into lines, which makes further processing more pleasant.

I used to use a one-line sed script to do this, where ABC is the field delimiter, and I want to retain A and C while replacing B with the newline:

s/ABC/A\nC/g
It took about 10 minutes, and was clearly CPU-bound.

So I replaced it with the exact same one-liner in Perl, called via "perl -pe", and it runs in 1 1/2 minutes, IO-bound.

This is GNU sed, the default one that comes with Debian. There seem to be other seds, some of which claim to be faster for some things, but many of them won't even work on this task at all, because they have line-length limits much lower than 2 billion characters, presumably due to operating in a line-oriented manner where they load each line in turn into a fixed-size buffer.

If I had to guess what the issue was, I'd guess Perl is probably doing something smarter with reading in the line and applying the regex to it blocks at a time, while sed is probably doing character-oriented I/O or using a very small buffer. In fact I had initially avoided using Perl because I thought it'd do something stupidly line-oriented like loading the first line into $_, taking up 2GB of memory before it even looked at the regex, whereas GNU sed promised to do no such things. But it looks like Perl is the one that handles this more smartly.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Poll
Text-munging
o perl 71%
o awk 14%
o sed 42%
o tcl 0%
o python 14%
o other 28%

Votes: 7
Results | Other Polls

Related Links
o Delirium's Diary


Display: Sort:
Why would sed be slower than perl | 14 comments (14 topical, editorial, 0 hidden)
oh man (none / 0) (#1)
by lostincali on Mon Mar 09, 2009 at 07:40:13 PM EST

i saw this when i was learning about regular expressions... i think sed/awk have a different kind of regular expression engine than perl on a syntactic level... i'd have to look it up.

"The least busy day [at McDonalds] is Monday, and then sales increase throughout the week, I guess as enthusiasm for life dwindles."

because perl has been optimised to hell and back (3.00 / 2) (#2)
by sholden on Mon Mar 09, 2009 at 07:43:30 PM EST

Making perl's C code harder to read than most perl code.

read the source for Perl_sv_gets in sv.c...

it even mention sed as comments as it gotos out of and back into a for loop to save some nanoseconds.

--
The world's dullest web page


GNU vs AT&T.... (none / 0) (#4)
by Del Griffith on Mon Mar 09, 2009 at 08:43:10 PM EST

I guess if you really wanted you could get some old version of SED and see how it performed vs the GNU one... your 2gig files would be the extreme example where suble errors are reveled...

I guess you could find 32/V or v7.. well they are basically the same thing.. Anyways from my 32/V instance:

total 80
-rw-r--r-- 1 root      155 Feb  8 23:20 Makefile
-rwxr-xr-x 1 root    12572 Mar 26 04:22 sed
-rw-r--r-- 1 root     2347 Nov  5 23:34 sed.h
-rw-r--r-- 1 root    15104 Jan 19 00:28 sed0.c
-rw-r--r-- 1 root     9469 Nov  5 23:34 sed1.c

I'm sure the GNU version of sed requires a few dozen megabytes of 'padding' and other bullshit to do some simple substitutions...

There used to be a project called v7upgrade-0.0.2.tar.gz that would have the old sed... but alas it's gone.

-------
I...I like me. My wife likes me. My customers like me. Because I'm the real article. What you see is what you get. - Me


Interesting (none / 0) (#5)
by mybostinks on Mon Mar 09, 2009 at 09:09:55 PM EST

because I am writing an article about "sed, gawk, grep and sort".

Why don't you write it?

perls control structures are relatively slow (none / 0) (#8)
by lemonjuicefake on Mon Mar 09, 2009 at 09:29:21 PM EST

some of the text processing stuff has been optimized however.

Would the 3 characters be in the set {A,T,C,G} (3.00 / 3) (#9)
by it certainly is on Mon Mar 09, 2009 at 09:33:09 PM EST

by any chance?

Perl very likely has a special case for this. The generic perl regexp, where you feed it an X-megabyte string to search/replace on and the replace modifies the size of the string, will choke up like Sebastian Droege knocking back a bucket of cocks.

Here's an analysis of Perl's general purpose regexp engine vs the awk/grep regexp engine.

kur0shin.org -- it certainly is

Godwin's law [...] is impossible to violate except with an infinitely long thread that doesn't mention nazis.

Why would anyone care? (none / 0) (#12)
by Ruston Rustov on Mon Mar 09, 2009 at 11:05:15 PM EST


I had had incurable open sores all over my feet for sixteen years. The doctors were powerless to do anything about it. I told my psychiatrist that they were psychosomatic Stigmata - the Stigmata are the wounds Jesus suffered when he was nailed to the cross. Three days later all my sores were gone. -- Michael Crawford
Maybe tomorrow. -- Michael Crawford
As soon as she has her first period, fuck your daughter. -- localroger

Because sed has finite buffer size (none / 0) (#13)
by GhostOfTiber on Tue Mar 10, 2009 at 07:17:59 AM EST

Remember, sed was written back when 128MB was a lot....

The other seds mostly play with the amount of buffers sed can use.

[Nimey's] wife's ass is my cocksheath. - undermyne

Well... (3.00 / 2) (#14)
by BJH on Tue Mar 10, 2009 at 07:29:07 AM EST

...both perl and sed are traditional NFA engines, so theoretically they shouldn't differ to greatly in their performance, but in reality perl has a large number of optimisations applied that make it perform well in common cases.

If you really want to know more, then this is the book to read, although I can tell you what his commentary on sed is:

POSIX standardized the workings of over 70 programs, including traditional regex-wielding tools suck as awk, ed, egrep, expr, grep, lex and sed. Most of these tools' regex flavor had (and still have) the weak powers equivalent to a moped. So weak, in fact, that I don't find them interesting for discussing regular expressions. Although they can certainly be extremely useful tools, you won't find much mention of expr, ed or sed in this book.

And he is indeed correct, as that is the only mention of sed in the entire book.
--
Roses are red, violets are blue.
I'm schizophrenic, and so am I.
-- Oscar Levant

Why would sed be slower than perl | 14 comments (14 topical, 0 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!