How to concatenate files that have the same beginning of a name? How to concatenate files that have the same beginning of a name? unix unix

How to concatenate files that have the same beginning of a name?


I will assume that the logic behind the naming is that the species are the first three words separated by underscores. I will also assume that there are no blank spaces in the filenames.

A possible strategy could be to get a list of all the species, and then concatenate all the files with that specie/prefix into a single one:

for specie in $(ls *.fasta | cut -f1-3 -d_ | sort -u)do    cat "$specie"*.fasta > "$specie.fasta"done

In this code, you list all the fasta files, cut the specie ID and generate an unique list of species. Then you traverse this list and, for every specie, concatenate all the files that start with that specie ID into a single file with the specie name.

More robust solutions can be written using find and avoiding ls, but they are more verbose and potentialy less clear:

while IFS= read -r -d '' speciedo    cat "$specie"*.fasta > "$specie.fasta"done < <(find -maxdepth 1 -name "*.fasta" -print0 | cut -z -f2 -d/ | cut -z -f1-3 -d_ | sort -zu)


As stated in my comment above, if you know all your basenames and don't mind entering them explicitly, a simple solution would be

for f in Homo_sapiens_cc21_*.fasta;     do cat $f >> Homo_sapiens_cc21.fasta; done

Since this is not the case, you need to find a a common pattern by which to group the output. From your examples (EDIT: and your comment), I looks like this could be three times a word followed by an underscore.

Assuming this pattern is correct, this would probably do what you require:

for f in *.fasta;     do cat $f >> $(echo $f | awk -F'_' '{print $1"_"$2"_"$3".fasta"}'); done

Explanation:

  1. List all the *,fasta files
  2. Construct a file name from the prefix. We do this by piping through awk, telling it to split the input by _ (-F'_') and putting it back together ('{print $1"_"$2"_"$3".fasta"}')
  3. Finally we cat the current file and redirect the output to the newly constructed file name