SED or AWK replace all with patterns from another file
Give a try to this one . Should be fast.
$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt
This formats the data of `patterns.txt like bellow without actually changing patterns.txt real contents:
$ printf 's/%s/%s/g\n' $(<patterns.txt)s/1000000001/9000000003/gs/1000000002/2000000001/gs/1000000003/3000000001/gs/1000000004/4000000001/gs/1000000005/5000000001/g
All above are then given with process substitution <(...)
to a simple sed
as a script file usingsed -f
switch = read sed commands from file
$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt288Y2RZDBPX9000000003dhanaJP2F64EI2000000001dEU9V3IXI3000000001dfg9000000003dfdfdsXATSSSSFOO4dhanaUXIBB7TF74000000001adf10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
Benchmarks for future reference
Test environment:
Using your sample files patterns.txt
with 50,000 lines and contents.txt
also with 50,000 lines.
All lines from patterns.txt
are loaded in all solutions but only the first 1000 lines of contents.txt
are examined.
Testing laptop is equipped with a dual core 64bit Intel(R) Celeron(R) CPU N3050 @ 2.16GHz, 4 GB RAM, Debian 9 64bit Testing , gnu sed 4.4
and gnu awk 4.1.4
In all cases the output is sent to a new file to avoid the slow overhead for printing data on the screen.
Results:
1. RavinderSingh13 1st awk solution
$ time awk 'FNR==NR{a[$1]=$2;next} {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt <(head -n 1000 contents.txt) >newcontents.txtreal 19m54.408suser 19m44.097ssys 0m1.981s
2. EdMorton 1st awk Solution
$ time awk 'NR==FNR{map[$1]=$2;next}{for (old in map) {gsub(old,map[old])}print}' patterns.txt <(head -n1000 contents.txt) >newcontents.txtreal 20m3.420suser 19m16.559ssys 0m2.325s
3. Sed (my sed) solution
$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -n 1000 contents.txt) >newcontents.txtreal 1m1.070suser 0m59.562ssys 0m1.443s
4. Cyrus sed solution
$ time sed -f <(sed -E 's|(.*) (.*)|s/\1/\2/|g' patterns.txt) <(head -n1000 contents.txt) >newcontents.txtreal 1m0.506suser 0m59.871ssys 0m1.209s
5. RavinderSingh13 2nd awk solution
$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt <(head -n 1000 contents.txt) >newcontents.txtreal 0m25.572suser 0m25.204ssys 0m0.040s
For a small amount of input data like 1000 lines, awk solution seems good. Lets make make another test with 9000 lines this time to compare performance
6.RavinderSingh13 2nd awk solution with 9000 lines
$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt <(head -9000 contents.txt) >newcontents.txtreal 22m25.222suser 22m19.567ssys 0m2.091s
7. Sed Solution with 9000 lines
$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -9000 contents.txt) >newcontents.txtreal 9m7.443suser 9m0.552ssys 0m2.650s
8. Parallel Seds Solution with 9000 lines
$ cat sedpar.shs=$SECONDSsed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -3000 contents.txt) >newcontents1.txt &sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +3001 contents.txt |head -3000) >newcontents2.txt &sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +6001 contents.txt |head -3000) >newcontents3.txt &waitcat newcontents1.txt newcontents2.txt newcontents3.txt >newcontents.txt && rm -f newcontents1.txt newcontents2.txt newcontents3.txtecho "seconds elapsed: $(($SECONDS-$s))"$ time ./sedpar.shseconds elapsed: 309real 5m16.594suser 9m43.331ssys 0m4.232s
Splitting the task to more commands like three parallel seds seems that can speed things up.
For those who would like to repeat the benchmarks on their own PC you can download files contents.txt
and patterns.txt
either by OP's links or by my github:
Could you please try following awk
and let me know if this helps you.
Solution 1st:
awk 'FNR==NR{a[$1]=$2;next} {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt sample_content.txt
Output will be as follows.
288Y2RZDBPX9000000003dhanaJP2F64EI2000000001dEU9V3IXI3000000001dfg9000000003dfdfdsXATSSSSFOO4dhanaUXIBB7TF74000000001adf10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw
Explanation of solution 1st: Adding explanation too now here:
awk 'FNR==NR{ ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read. ##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read. a[$1]=$2; ##creating an array a whose index is first field of line and value is 2nd field of current line. next ##next will skip all further statements for now.}{for(i in a){ ##Starting a for loop which traverse through array a all element. match($0,i); ##Using match function of awk which will try to match index if array a present in variable i. val=substr($0,RSTART,RLENGTH); ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value. if(val){ ##Checking condition if variable val is NOT NULL then do following: sub(val,a[i])} ##using sub function of awk to substitute variable val value with array a value of index i.}; print ##Using print here to print the current line either changed or not changed one.}' patterns.txt sample_content.txt ##Mentioning the Input_file(s) name here.
Solution 2nd: Without traversing all the time to array as like first solution coming out of array when a match is found as follows:
awk 'FNR==NR{ ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read. ##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read. a[$1]=$2; ##creating an array a whose index is first field of line and value is 2nd field of current line. next ##next will skip all further statements for now.}{for(i in a){ ##Starting a for loop which traverse through array a all element. match($0,i); ##Using match function of awk which will try to match index if array a present in variable i. val=substr($0,RSTART,RLENGTH); ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value. if(val){ ##Checking condition if variable val is NOT NULL then do following: sub(val,a[i]);print;next} ##using sub function of awk to subsitute variable val value with array a value of index i.};}1' patterns.txt sample_content.txt ##Mentioning the Input_file(s) name here.