How to compare multiple lines in one file and output a combined entry

python perl unix awk bioinformatics

I suspect that this could be done pretty tidily with Pandas, much better than this, but I'm not very familiar with Pandas yet, so... submitted without debugging.

def longest_identical_substring(words):    result = words[0]    for idx in range(len(words[0]), 0, -1):        substrings = [w[:idx] for w in words]        if max(substrings) == min(substrings):             result = substrings[0]        else:            return resulttranscripts = defaultdict(list)with open('myfile.csv') as infile:    reader = csv.reader(infile)    for row in reader:        transcripts[row[:3]].append(row[3])for ((chr, start, end), ts) in transcripts.items():    print(chr, start, end, longest_identical_substring(ts))

python perl unix awk bioinformatics

One way with awk. You can pipe it to sort if needed.

Content of `script.awk`

(a[$1" "$2" "$3]) {    t=0; word=""; delete w1; delete w2;    split($4,w1,"");     split(a[$1" "$2" "$3],w2,"");    t=(length($4)<length(a[$1" "$2" "$3]))?length($4):length(a[$1" "$2" "$3])    for (x=1;x<=t;x++) {         if (w1[x]==w2[x]) {             word=word""w1[x]         }    a[$1" "$2" "$3]=word    }    next} {    a[$1" "$2" "$3]=$4}END {        for (x in a)  print x,a[x]}

Your file:

$ cat filechrI    128980  129130  F53G12.5bchrI    132280  132430  F53G12.5c.2chrI    132280  132430  F53G12.5achrI    132280  132430  F53G12.5bchrI    132280  132430  F53G12.5c.1chrI    133600  133750  F53G12.5c.2chrI    133600  133750  F53G12.5achrI    133600  133750  F53G12.5bchrI    133600  133750  F53G12.5c.1chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2achrI    163220  163370  F56C11.2bchrI    173900  174050  F56C11.6achrI    173900  174050  F56C11.6bchrI    173900  174050  F56C11.6cchrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2achrI    190720  190870  Y48G1BL.2a

Output:

$ awk -f script.awk filechrI 173900 174050 F56C11.6chrI 128980 129130 F53G12.5bchrI 182240 182390 F56C11.3chrI 139100 139250 F53G12.3chrI 136240 136390 F53G12.4chrI 132280 132430 F53G12.5chrI 163220 163370 F56C11.2chrI 184080 184230 Y48G1BL.2achrI 190720 190870 Y48G1BL.2achrI 133600 133750 F53G12.5

python perl unix awk bioinformatics

Without all of the debugging here is a simple awk statement:

awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt chrI    128980  129130  F53G12.5chrI    132280  132430  F53G12.5chrI    132280  132430  F53G12.5chrI    132280  132430  F53G12.5chrI    132280  132430  F53G12.5chrI    133600  133750  F53G12.5chrI    133600  133750  F53G12.5chrI    133600  133750  F53G12.5chrI    133600  133750  F53G12.5chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2chrI    163220  163370  F56C11.2chrI    173900  174050  F56C11.6chrI    173900  174050  F56C11.6chrI    173900  174050  F56C11.6chrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2chrI    190720  190870  Y48G1BL.2awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort|uniqchrI    128980  129130  F53G12.5chrI    132280  132430  F53G12.5chrI    133600  133750  F53G12.5chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2chrI    173900  174050  F56C11.6chrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2chrI    190720  190870  Y48G1BL.2

To sort by columns sort -nrk numeric reverse k for column id which in this case i passed 2 and 3

awk -F"." '{ trimmed=substr($2,RSTART,1);print $1"."trimmed;}' test.txt |sort -nrk2,3|uniqchrI    190720  190870  Y48G1BL.2chrI    184080  184230  Y48G1BL.2chrI    182240  182390  F56C11.3chrI    173900  174050  F56C11.6chrI    163220  163370  F56C11.2chrI    139100  139250  F53G12.3chrI    136240  136390  F53G12.4chrI    133600  133750  F53G12.5chrI    132280  132430  F53G12.5chrI    128980  129130  F53G12.5

updated based on columns:

awk  '{ if( match($4, /[0-9a-zA-Z]+\.[0-9a-zA-Z]/)) {  trimmed=substr($4,RSTART,RLENGTH); } print $1"\t"$2"\t"$3"\t"trimmed;}' test.txt |sort|uniqchrI    128980  129130  F53G12.5chrI    132280  132430  F53G12.5chrI    133600  133750  F53G12.5chrI    136240  136390  F53G12.4chrI    139100  139250  F53G12.3chrI    163220  163370  F56C11.2chrI    173900  174050  F56C11.6chrI    182240  182390  F56C11.3chrI    184080  184230  Y48G1BL.2chrI    190720  190870  Y48G1BL.2

CodeHunter

How to compare multiple lines in one file and output a combined entry

Content of `script.awk`

Your file:

Output:

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last

How to compare multiple lines in one file and output a combined entry

Content of script.awk

Your file:

Output:

Recent Posts

Content of `script.awk`