How to compare multiple lines in one file and output a combined entry
I suspect that this could be done pretty tidily with Pandas, much better than this, but I'm not very familiar with Pandas yet, so... submitted without debugging.
def longest_identical_substring(words): result = words[0] for idx in range(len(words[0]), 0, -1): substrings = [w[:idx] for w in words] if max(substrings) == min(substrings): result = substrings[0] else: return resulttranscripts = defaultdict(list)with open('myfile.csv') as infile: reader = csv.reader(infile) for row in reader: transcripts[row[:3]].append(row[3])for ((chr, start, end), ts) in transcripts.items(): print(chr, start, end, longest_identical_substring(ts))
One way with awk
. You can pipe it to sort
if needed.
Content of script.awk
(a[$1" "$2" "$3]) { t=0; word=""; delete w1; delete w2; split($4,w1,""); split(a[$1" "$2" "$3],w2,""); t=(length($4)<length(a[$1" "$2" "$3]))?length($4):length(a[$1" "$2" "$3]) for (x=1;x<=t;x++) { if (w1[x]==w2[x]) { word=word""w1[x] } a[$1" "$2" "$3]=word } next} { a[$1" "$2" "$3]=$4}END { for (x in a) print x,a[x]}
Your file:
$ cat filechrI 128980 129130 F53G12.5bchrI 132280 132430 F53G12.5c.2chrI 132280 132430 F53G12.5achrI 132280 132430 F53G12.5bchrI 132280 132430 F53G12.5c.1chrI 133600 133750 F53G12.5c.2chrI 133600 133750 F53G12.5achrI 133600 133750 F53G12.5bchrI 133600 133750 F53G12.5c.1chrI 136240 136390 F53G12.4chrI 139100 139250 F53G12.3chrI 163220 163370 F56C11.2achrI 163220 163370 F56C11.2bchrI 173900 174050 F56C11.6achrI 173900 174050 F56C11.6bchrI 173900 174050 F56C11.6cchrI 182240 182390 F56C11.3chrI 184080 184230 Y48G1BL.2achrI 190720 190870 Y48G1BL.2a
Output:
$ awk -f script.awk filechrI 173900 174050 F56C11.6chrI 128980 129130 F53G12.5bchrI 182240 182390 F56C11.3chrI 139100 139250 F53G12.3chrI 136240 136390 F53G12.4chrI 132280 132430 F53G12.5chrI 163220 163370 F56C11.2chrI 184080 184230 Y48G1BL.2achrI 190720 190870 Y48G1BL.2achrI 133600 133750 F53G12.5
Without all of the debugging here is a simple awk statement:
awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt chrI 128980 129130 F53G12.5chrI 132280 132430 F53G12.5chrI 132280 132430 F53G12.5chrI 132280 132430 F53G12.5chrI 132280 132430 F53G12.5chrI 133600 133750 F53G12.5chrI 133600 133750 F53G12.5chrI 133600 133750 F53G12.5chrI 133600 133750 F53G12.5chrI 136240 136390 F53G12.4chrI 139100 139250 F53G12.3chrI 163220 163370 F56C11.2chrI 163220 163370 F56C11.2chrI 173900 174050 F56C11.6chrI 173900 174050 F56C11.6chrI 173900 174050 F56C11.6chrI 182240 182390 F56C11.3chrI 184080 184230 Y48G1BL.2chrI 190720 190870 Y48G1BL.2awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort|uniqchrI 128980 129130 F53G12.5chrI 132280 132430 F53G12.5chrI 133600 133750 F53G12.5chrI 136240 136390 F53G12.4chrI 139100 139250 F53G12.3chrI 163220 163370 F56C11.2chrI 173900 174050 F56C11.6chrI 182240 182390 F56C11.3chrI 184080 184230 Y48G1BL.2chrI 190720 190870 Y48G1BL.2
To sort by columns sort -nrk numeric reverse k for column id which in this case i passed 2 and 3
awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort -nrk2,3|uniqchrI 190720 190870 Y48G1BL.2chrI 184080 184230 Y48G1BL.2chrI 182240 182390 F56C11.3chrI 173900 174050 F56C11.6chrI 163220 163370 F56C11.2chrI 139100 139250 F53G12.3chrI 136240 136390 F53G12.4chrI 133600 133750 F53G12.5chrI 132280 132430 F53G12.5chrI 128980 129130 F53G12.5
updated based on columns:
awk '{ if( match($4, /[0-9a-zA-Z]+\.[0-9a-zA-Z]/)) { trimmed=substr($4,RSTART,RLENGTH); } print $1"\t"$2"\t"$3"\t"trimmed;}' test.txt |sort|uniqchrI 128980 129130 F53G12.5chrI 132280 132430 F53G12.5chrI 133600 133750 F53G12.5chrI 136240 136390 F53G12.4chrI 139100 139250 F53G12.3chrI 163220 163370 F56C11.2chrI 173900 174050 F56C11.6chrI 182240 182390 F56C11.3chrI 184080 184230 Y48G1BL.2chrI 190720 190870 Y48G1BL.2