How to match for multiple patterns in the specific column? How to match for multiple patterns in the specific column? unix unix

How to match for multiple patterns in the specific column?


Use a regular expression:

awk '$1 ~ /^chr(1?[0-9]|2[0-2]|X|Y)$/' file

This uses $1 ~ /^pattern$/ to chose the good lines consisting in exactly pattern (note ^ for beginning and $ for end).

The pattern is on the form chr(..|..|..), meaning: match chr followed by either of the |-separated conditions within ().

These conditions can be either of:

  • a number (possible 1 followed by a digit) (1?[0-9])
  • a number being 2 + any of 0, 1, 2 (2[0-2])
  • X
  • Y

Demo automatically explained: https://regex101.com/r/gH1kS4/2


If you want something easier to maintain (e.g. editing or adding new lines/patterns to match) and also something easier to understand, especially if you just started engaging with regular expressions, use the grep -f match.list input.txt format:

Create a file with the patterns you want to match (match.list):

^chr[1-9][[:space:]]\|      # this matches chr1-chr9^chr1[0-9][[:space:]]\|     # this matches chr10-chr19^chr2[12][[:space:]]\|      # this matches chr21-22^chr[XY][[:space:]]\|       # this matches chrX and chrYnew_string_or_pattern\|     # ... your new pattern ...

Then just call grep like this:

grep -f match.list input.txt

As you can see above, you can even add comments to the list of patterns, using the \| trick (ending each pattern with \|), so you can remember what you did yesterday or where did you find the regex. And you may add new fixed strings or patterns by just adding new lines. Also, if you find it difficult to create a complex regex, you may just create a pattern file with the fixed strings you want to match:

^chrX^chrY...

Another benefit of this approach is that you may maintain several pattern files, representing different sub-queries you may need to run daily. E.g.

grep -f chromosomes_n input.txtgrep -f chromosomes_xy input.txtgrep -f chromosomes_random input.txt

The only drawback of the approach is that grep will get slower if you add more than a dozen patterns in each file. But that will be a problem only if your input file has hundreds of thousands of lines.


You can use this simplified regex with grep:

grep "^chr\(1\?[0-9]\|2[012]\|[XY]\)[[:space:]]" filename

The logic is contained within the parentheses \(..\)

  • 1\?[0-9] - match 0-9 optionally preceded by 1
  • 2[012] - match 2 followed by 0, 1 or 2
  • [XY] - match X or Y