How to make this sed script faster?

The best I was able to do with sed, was this script:

s/[\s\t]*|[\s\t]*/|/gs/[\s\t]*$//s/^|/null|/

In my tests, this ran about 30% faster than your sed script. The increase in performance comes from combining the first two regexen and omitting the "g" flag where it's not needed.

However, 30% faster is only a mild improvement (it should still take about an hour and a half to run the above script on your 1GB data file). I wanted to see if I could do any better.

In the end, no other method I tried (awk, perl, and other approaches with sed) fared any better, except -- of course -- a plain ol' C implementation. As would be expected with C, the code is a bit verbose for posting here, but if you want a program that's likely going to be faster than any other method out there, you may want to take a look at it.

In my tests, the C implementation finishes in about 20% of the time it takes for your sed script. So it might take about 25 minutes or so to run on your Unix server.

I didn't spend much time optimizing the C implementation. No doubt there are a number of places where the algorithm could be improved, but frankly, I don't know if it's possible to shave a significant amount of time beyond what it already achieves. If anything, I think it certainly places an upper limit on what kind of performance you can expect from other methods (sed, awk, perl, python, etc).

Edit: The original version had a minor bug that caused it to possibly print the wrong thing at the end of the output (e.g. could print a "null" that shouldn't be there). I had some time today to take a look at it and fixed that. I also optimized away a call to strlen() that gave it another slight performance boost.

linux performance unix sed

My testing indicated that sed can become CPU bound pretty easily on something like this. If you have a multi-core machine you can try spawning off multiple sed processes with a script that looks something like this:

#!/bin/shINFILE=data.txtOUTFILE=fixed.txtSEDSCRIPT=script.sedSPLITLIMIT=`wc -l $INFILE | awk '{print $1 / 20}'`split -d -l $SPLITLIMT $INFILE x_for chunk in ls x_??do  sed -f $SEDSCRIPT $chunk > $chunk.out &donewait cat x_??.out >> output.txtrm -f x_??rm -f x_??.out

linux performance unix sed

It seems to me from your example that you are cleaning up white space from the beginning and end of pipe (|) delimited fields in a text file. If I were to do this, I would change the algorithm to the following:

for each line    split the line into an array of fields    remove the leading and trailing white space    join the fields back back together as a pipe delimited line handling the empty first field correctly.

I would also use a different language such as Perl or Ruby for this.

The advantage of this approach is that the code that cleans up the lines now handles fewer characters for each invocation and should execute much faster even though more invocations are needed.

CodeHunter

How to make this sed script faster?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last