Trouble formulating a regular expression for use with sed to extract column values Trouble formulating a regular expression for use with sed to extract column values shell shell

Trouble formulating a regular expression for use with sed to extract column values


Use the right tool for the job. If you're processing columns, awk is a better solution:

ls -la | awk '{print $5}'

Given your ls -la output, that should generate:

Size264096

If, for some bizarre reason you cannot use the correct tool, the following sed command will work, but it's rather ugly:

sed 's/[ \t]*[0-9][0-9][0-9][0-9]-.*//;s/[ \t]*Date.*//;s/^.*[ \t]//'

It works by removing from the year column (9999-) and preceding tabs/spaces, to the end of the line.

Then it does something similar for the header.

Then it just removes everything from line start to the final tab/space, which is now just before the size column.

I know which one I'd prefer to write and maintain :-)


The general caveat applies: awk is the better tool for the job.

Here's a simpler sed solution:

ls -la | sed -E 's/^(([^[:space:]]+)[[:space:]]+){5}.*/\2/'
  • works with both spaces and tabs between columns
  • takes advantage of repeating capture groups only reporting the last captured instance - in this case, the 5th column
  • caveat: will not work correctly with filenames with embedded spaces

In case only spaces separate the columns - which is the case with ls output, the command simplifies to:

ls -la | sed -E 's/^(([^ ]+)[ ]+){5}.*/\2/' 

To skip the first input line you have several options, but the simplest is to prepend 1d to your sed program:

ls -la | sed -E '1d; s/^(([^ ]+)[ ]+){5}.*/\2/'

(Other options:

Use tail to skip the first line:

ls -la | tail +2 | sed -E 's/^(([^ ]+)[ ]+){5}.*/\2/'

More generically, use sed to ignore lines that do not have at least 5 columns:

ls -la | sed -E -n 's/^(([^ ]+)[ ]+){5}.*/\2/p'
  • -n suppresses default output
  • appending p to the substitution command only produces output if a substitution was made

)

To show only the 3 largest files (a requirement added later by the OP), courtesy of @JS웃:

ls -la | sed -E '2d; s/^(([^ ]+)[ ]+){5}.*/\2/' | sort -nr | head -3

The above will not output the header line, however.To include the header line, use (courtesy of this unix.stackexchange.com answer):

ls -la | sed -E '1d; s/^(([^ ]+)[ ]+){5}.*/\2/' |   { IFS= read -r l; echo "$l"; sort -nr | head -3; }


Here is another way with GNU sed:

ls -la | sed -r '1d;s/([^ ]+ *){4}([^ ]+).*/\2/' 

If your version of sed does not support -r option, then use -E.