SED or AWK replace all with patterns from another file SED or AWK replace all with patterns from another file shell shell

SED or AWK replace all with patterns from another file


Give a try to this one . Should be fast.

$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt

This formats the data of `patterns.txt like bellow without actually changing patterns.txt real contents:

$ printf 's/%s/%s/g\n' $(<patterns.txt)s/1000000001/9000000003/gs/1000000002/2000000001/gs/1000000003/3000000001/gs/1000000004/4000000001/gs/1000000005/5000000001/g

All above are then given with process substitution <(...) to a simple sed as a script file using
sed -f switch = read sed commands from file

$ sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) contents.txt288Y2RZDBPX9000000003dhanaJP2F64EI2000000001dEU9V3IXI3000000001dfg9000000003dfdfdsXATSSSSFOO4dhanaUXIBB7TF74000000001adf10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw


Benchmarks for future reference

Test environment:

Using your sample files patterns.txt with 50,000 lines and contents.txt also with 50,000 lines.

All lines from patterns.txt are loaded in all solutions but only the first 1000 lines of contents.txt are examined.

Testing laptop is equipped with a dual core 64bit Intel(R) Celeron(R) CPU N3050 @ 2.16GHz, 4 GB RAM, Debian 9 64bit Testing , gnu sed 4.4 and gnu awk 4.1.4

In all cases the output is sent to a new file to avoid the slow overhead for printing data on the screen.

Results:

1. RavinderSingh13 1st awk solution

$ time awk 'FNR==NR{a[$1]=$2;next}   {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt  <(head -n 1000 contents.txt) >newcontents.txtreal    19m54.408suser    19m44.097ssys 0m1.981s

2. EdMorton 1st awk Solution

$ time awk 'NR==FNR{map[$1]=$2;next}{for (old in map) {gsub(old,map[old])}print}' patterns.txt <(head -n1000 contents.txt) >newcontents.txtreal    20m3.420suser    19m16.559ssys 0m2.325s

3. Sed (my sed) solution

$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -n 1000 contents.txt) >newcontents.txtreal    1m1.070suser    0m59.562ssys 0m1.443s

4. Cyrus sed solution

$ time sed -f <(sed -E 's|(.*) (.*)|s/\1/\2/|g' patterns.txt) <(head -n1000 contents.txt) >newcontents.txtreal    1m0.506suser    0m59.871ssys 0m1.209s

5. RavinderSingh13 2nd awk solution

$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt  <(head -n 1000 contents.txt) >newcontents.txtreal    0m25.572suser    0m25.204ssys     0m0.040s

For a small amount of input data like 1000 lines, awk solution seems good. Lets make make another test with 9000 lines this time to compare performance

6.RavinderSingh13 2nd awk solution with 9000 lines

$ time awk 'FNR==NR{a[$1]=$2;next}{for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i]);print;next}};}1' patterns.txt  <(head -9000 contents.txt) >newcontents.txtreal    22m25.222suser    22m19.567ssys      0m2.091s

7. Sed Solution with 9000 lines

$ time sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -9000 contents.txt) >newcontents.txtreal    9m7.443suser    9m0.552ssys     0m2.650s

8. Parallel Seds Solution with 9000 lines

$ cat sedpar.shs=$SECONDSsed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(head -3000 contents.txt) >newcontents1.txt &sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +3001 contents.txt |head -3000) >newcontents2.txt &sed -f <(printf 's/%s/%s/g\n' $(<patterns.txt)) <(tail +6001 contents.txt |head -3000) >newcontents3.txt &waitcat newcontents1.txt newcontents2.txt newcontents3.txt >newcontents.txt && rm -f newcontents1.txt newcontents2.txt newcontents3.txtecho "seconds elapsed: $(($SECONDS-$s))"$ time ./sedpar.shseconds elapsed: 309real    5m16.594suser    9m43.331ssys     0m4.232s

Splitting the task to more commands like three parallel seds seems that can speed things up.

For those who would like to repeat the benchmarks on their own PC you can download files contents.txt and patterns.txt either by OP's links or by my github:

contents.txt

patterns.txt


Could you please try following awk and let me know if this helps you.

Solution 1st:

awk 'FNR==NR{a[$1]=$2;next}   {for(i in a){match($0,i);val=substr($0,RSTART,RLENGTH);if(val){sub(val,a[i])}};print}' patterns.txt  sample_content.txt

Output will be as follows.

288Y2RZDBPX9000000003dhanaJP2F64EI2000000001dEU9V3IXI3000000001dfg9000000003dfdfdsXATSSSSFOO4dhanaUXIBB7TF74000000001adf10Q1W4ZEAV18LXNPSPGRTTIDHBN5000000001egw

Explanation of solution 1st: Adding explanation too now here:

awk 'FNR==NR{                           ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read.                                   ##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read.  a[$1]=$2;                        ##creating an array a whose index is first field of line and value is 2nd field of current line.  next                             ##next will skip all further statements for now.}{for(i in a){                       ##Starting a for loop which traverse through array a all element.  match($0,i);                     ##Using match function of awk which will try to match index if array a present in variable i.  val=substr($0,RSTART,RLENGTH);   ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value.  if(val){                         ##Checking condition if variable val is NOT NULL then do following:    sub(val,a[i])}                 ##using sub function of awk to substitute variable val value with array a value of index i.};  print                            ##Using print here to print the current line either changed or not changed one.}' patterns.txt  sample_content.txt ##Mentioning the Input_file(s) name here.

Solution 2nd: Without traversing all the time to array as like first solution coming out of array when a match is found as follows:

awk 'FNR==NR{                           ##FNR==NR is a condition which will be TRUE when only first Input_file patterns.txt is being read.                                   ##FNR and NR both represents line number of Input_file(s) where FNR value will be RESET when a new Input_file is getting read on the other hand NR value will be keep increasing till all Input_file(s) read.  a[$1]=$2;                        ##creating an array a whose index is first field of line and value is 2nd field of current line.  next                             ##next will skip all further statements for now.}{for(i in a){                       ##Starting a for loop which traverse through array a all element.  match($0,i);                     ##Using match function of awk which will try to match index if array a present in variable i.  val=substr($0,RSTART,RLENGTH);   ##Creating a variable named val which contains the substring of current line substring starts from value of variable RSTART till RLENGTH value.  if(val){                         ##Checking condition if variable val is NOT NULL then do following:    sub(val,a[i]);print;next}                 ##using sub function of awk to subsitute variable val value with array a value of index i.};}1' patterns.txt  sample_content.txt ##Mentioning the Input_file(s) name here.