Python or awk/sed for cleaning data [closed] Python or awk/sed for cleaning data [closed] python python

Python or awk/sed for cleaning data [closed]


Not to spoil your adventure, but I'd say no and here is why:

  • R is vectorised where sed/awk are not
  • R already has both Perl regular expression and extended regular expressions
  • R can more easily make recourse to statistical routines (say, imputation) if you need it
  • R can visualize, summarize, ...

and most importantly: you already know R.

That said, of course sed/awk are great for small programs or even one-liners and Python is a fine language. But I would consider to also stick with R.


I use Python and Perl regularly. I know sed fairly well and once used awk a lot. I've used R in fits and spurts. Perl is the best of the bunch for data transformation function and speed.

  • Perl can do essentially everything sed and awk can do, but lots more as well. (In fact, a2p and s2p, which come with perl, convert awk and sed scripts to Perl.)
  • Perl is included with most Linux/Unix systems. When that wasn't the case, there was good reason to learn sed and awk. That reason is long dead.
  • Perl has a rich set of modules that provide much more power than one can get from awk or sed. For example, these modules enable one-liners that reverse complement DNA sequences, compute statistics, parse CSV files, or calculate MD5s. (see http://cpan.org/ for packages)
  • Perl is essentially as terse as sed and awk. For people like me (and, I suspect, you), quickly transforming data on the command line is a great boon. Python's too wordy for efficient command line use.

I'm honestly at a loss to think why one would learn sed and awk over Perl.

For the record, I'm not "a Perl guy". I like it as a swiss army knife, not as a religion.


I would recommend sed/awk along with the wealth of other command line tools available on UNIX-alike platforms: comm, tr, sort, cut, join, grep, and built in shell capabilities like looping and whatnot. You really don't need to learn another programming language as R can handle data manipulation as well as if not better than the other popular scripting languages.