How to compare multiple lines in one file and output a combined entry How to compare multiple lines in one file and output a combined entry unix unix

How to compare multiple lines in one file and output a combined entry


I suspect that this could be done pretty tidily with Pandas, much better than this, but I'm not very familiar with Pandas yet, so... submitted without debugging.

def longest_identical_substring(words):    result = words[0]    for idx in range(len(words[0]), 0, -1):        substrings = [w[:idx] for w in words]        if max(substrings) == min(substrings):             result = substrings[0]        else:            return resulttranscripts = defaultdict(list)with open('myfile.csv') as infile:    reader = csv.reader(infile)    for row in reader:        transcripts[row[:3]].append(row[3])for ((chr, start, end), ts) in transcripts.items():    print(chr, start, end, longest_identical_substring(ts))


One way with awk. You can pipe it to sort if needed.

Content of script.awk

(a[$1" "$2" "$3]) {    t=0; word=""; delete w1; delete w2;    split($4,w1,"");     split(a[$1" "$2" "$3],w2,"");    t=(length($4)<length(a[$1" "$2" "$3]))?length($4):length(a[$1" "$2" "$3])    for (x=1;x<=t;x++) {         if (w1[x]==w2[x]) {             word=word""w1[x]         }    a[$1" "$2" "$3]=word    }    next} {    a[$1" "$2" "$3]=$4}END {        for (x in a)  print x,a[x]}

Your file:

$ cat filechrI    128980  129130  F53G12.5bchrI    132280  132430  F53G12.5c.2chrI    132280  132430  F53G12.5achrI    132280  132430  F53G12.5bchrI    132280  132430  F53G12.5c.1chrI    133600  133750  F53G12.5c.2chrI    133600  133750  F53G12.5achrI    133600  133750  F53G12.5bchrI    133600  133750  F53G12.5c.1chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2achrI    163220  163370  F56C11.2bchrI    173900  174050  F56C11.6achrI    173900  174050  F56C11.6bchrI    173900  174050  F56C11.6cchrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2achrI    190720  190870  Y48G1BL.2a

Output:

$ awk -f script.awk filechrI 173900 174050 F56C11.6chrI 128980 129130 F53G12.5bchrI 182240 182390 F56C11.3chrI 139100 139250 F53G12.3chrI 136240 136390 F53G12.4chrI 132280 132430 F53G12.5chrI 163220 163370 F56C11.2chrI 184080 184230 Y48G1BL.2achrI 190720 190870 Y48G1BL.2achrI 133600 133750 F53G12.5


Without all of the debugging here is a simple awk statement:

awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt chrI    128980  129130  F53G12.5chrI    132280  132430  F53G12.5chrI    132280  132430  F53G12.5chrI    132280  132430  F53G12.5chrI    132280  132430  F53G12.5chrI    133600  133750  F53G12.5chrI    133600  133750  F53G12.5chrI    133600  133750  F53G12.5chrI    133600  133750  F53G12.5chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2chrI    163220  163370  F56C11.2chrI    173900  174050  F56C11.6chrI    173900  174050  F56C11.6chrI    173900  174050  F56C11.6chrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2chrI    190720  190870  Y48G1BL.2awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort|uniqchrI    128980  129130  F53G12.5chrI    132280  132430  F53G12.5chrI    133600  133750  F53G12.5chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2chrI    173900  174050  F56C11.6chrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2chrI    190720  190870  Y48G1BL.2

To sort by columns sort -nrk numeric reverse k for column id which in this case i passed 2 and 3

awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort -nrk2,3|uniqchrI    190720  190870  Y48G1BL.2chrI    184080  184230  Y48G1BL.2chrI    182240  182390  F56C11.3chrI    173900  174050  F56C11.6chrI    163220  163370  F56C11.2chrI    139100  139250  F53G12.3chrI    136240  136390  F53G12.4chrI    133600  133750  F53G12.5chrI    132280  132430  F53G12.5chrI    128980  129130  F53G12.5

updated based on columns:

awk  '{ if( match($4, /[0-9a-zA-Z]+\.[0-9a-zA-Z]/)) {  trimmed=substr($4,RSTART,RLENGTH); } print $1"\t"$2"\t"$3"\t"trimmed;}' test.txt |sort|uniqchrI    128980  129130  F53G12.5chrI    132280  132430  F53G12.5chrI    133600  133750  F53G12.5chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2chrI    173900  174050  F56C11.6chrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2chrI    190720  190870  Y48G1BL.2