2 April 2019

Filter Lists of Files in bash Using sed

by Aidan Heerdegen

We had a recent query on the Slack support channel asking how, when using bash, to programatically generate a listing of files when the year was part the filename, given a begin year and end year.

What do I mean by that? Well the files looked a bit like this:

file.0197-03.nc

so in this case the file is the third month of year 197.

The question was: in a script where the first and last year are stored as variables, how to easily make a list of files that can be looped over in bash?

Fast answer

Use eval and brace expansion in combination with wildcard globbing, e.g.

$ eval ls file.*{$begin..$end}-*.nc

(NB: The $ sign at the beginning of a line signifies the unix prompt, to delineate between a command and it’s output)

If you want to know why, read on …

Making a test directory

First step, make a test directory to test solutions like so:

$ mkdir tmp
$ touch tmp/file.{0197..0202}-{01..12}.nc
$ ls tmp

This has created the following files:

$ ls tmp
file.0197-01.nc  file.0198-07.nc  file.0200-01.nc  file.0201-07.nc
file.0197-02.nc  file.0198-08.nc  file.0200-02.nc  file.0201-08.nc
file.0197-03.nc  file.0198-09.nc  file.0200-03.nc  file.0201-09.nc
file.0197-04.nc  file.0198-10.nc  file.0200-04.nc  file.0201-10.nc
file.0197-05.nc  file.0198-11.nc  file.0200-05.nc  file.0201-11.nc
file.0197-06.nc  file.0198-12.nc  file.0200-06.nc  file.0201-12.nc
file.0197-07.nc  file.0199-01.nc  file.0200-07.nc  file.0202-01.nc
file.0197-08.nc  file.0199-02.nc  file.0200-08.nc  file.0202-02.nc
file.0197-09.nc  file.0199-03.nc  file.0200-09.nc  file.0202-03.nc
file.0197-10.nc  file.0199-04.nc  file.0200-10.nc  file.0202-04.nc
file.0197-11.nc  file.0199-05.nc  file.0200-11.nc  file.0202-05.nc
file.0197-12.nc  file.0199-06.nc  file.0200-12.nc  file.0202-06.nc
file.0198-01.nc  file.0199-07.nc  file.0201-01.nc  file.0202-07.nc
file.0198-02.nc  file.0199-08.nc  file.0201-02.nc  file.0202-08.nc
file.0198-03.nc  file.0199-09.nc  file.0201-03.nc  file.0202-09.nc
file.0198-04.nc  file.0199-10.nc  file.0201-04.nc  file.0202-10.nc
file.0198-05.nc  file.0199-11.nc  file.0201-05.nc  file.0202-11.nc
file.0198-06.nc  file.0199-12.nc  file.0201-06.nc  file.0202-12.nc

The above command uses brace expansion, which I will explain later.

How to list files within a given year range?

So, the question was, given a begin year and end year, say

begin=199
end=201

how can you easily obtain an ordered list of all the months in all the years between, and including the two years?

Well firstly, the year is encoded in the filename as a left zero-padded 4 digit number, so the first step is to use printf to perform this transformation:

$ begin=$(printf "%04d" $begin)
$ end=$(printf "%04d" $end)
$ echo $begin $end
0199 0201

I haven’t simply pre-pended a zero at the beginning of the year otherwise it will not work for years > 999.

Globs

Shell globbing uses wildcards and a process called “filename expansion” which is very useful and can be used to easily match the required pattern:

$ ls tmp/file.0199-*.nc tmp/file.0200-*.nc tmp/file.0201-*.nc
tmp/file.0199-01.nc  tmp/file.0200-01.nc  tmp/file.0201-01.nc
tmp/file.0199-02.nc  tmp/file.0200-02.nc  tmp/file.0201-02.nc
tmp/file.0199-03.nc  tmp/file.0200-03.nc  tmp/file.0201-03.nc
tmp/file.0199-04.nc  tmp/file.0200-04.nc  tmp/file.0201-04.nc
tmp/file.0199-05.nc  tmp/file.0200-05.nc  tmp/file.0201-05.nc
tmp/file.0199-06.nc  tmp/file.0200-06.nc  tmp/file.0201-06.nc
tmp/file.0199-07.nc  tmp/file.0200-07.nc  tmp/file.0201-07.nc
tmp/file.0199-08.nc  tmp/file.0200-08.nc  tmp/file.0201-08.nc
tmp/file.0199-09.nc  tmp/file.0200-09.nc  tmp/file.0201-09.nc
tmp/file.0199-10.nc  tmp/file.0200-10.nc  tmp/file.0201-10.nc
tmp/file.0199-11.nc  tmp/file.0200-11.nc  tmp/file.0201-11.nc
tmp/file.0199-12.nc  tmp/file.0200-12.nc  tmp/file.0201-12.nc

but that is difficult to generate programmatically given arbitrary start and end years.

It can be done using seq:

$ for year in $(seq -f "%04.0f" $begin $end); do echo $year; ls tmp/*$year-*.nc; done
0199
tmp/file.0199-01.nc  tmp/file.0199-05.nc  tmp/file.0199-09.nc
tmp/file.0199-02.nc  tmp/file.0199-06.nc  tmp/file.0199-10.nc
tmp/file.0199-03.nc  tmp/file.0199-07.nc  tmp/file.0199-11.nc
tmp/file.0199-04.nc  tmp/file.0199-08.nc  tmp/file.0199-12.nc
0200
tmp/file.0200-01.nc  tmp/file.0200-05.nc  tmp/file.0200-09.nc
tmp/file.0200-02.nc  tmp/file.0200-06.nc  tmp/file.0200-10.nc
tmp/file.0200-03.nc  tmp/file.0200-07.nc  tmp/file.0200-11.nc
tmp/file.0200-04.nc  tmp/file.0200-08.nc  tmp/file.0200-12.nc
0201
tmp/file.0201-01.nc  tmp/file.0201-05.nc  tmp/file.0201-09.nc
tmp/file.0201-02.nc  tmp/file.0201-06.nc  tmp/file.0201-10.nc
tmp/file.0201-03.nc  tmp/file.0201-07.nc  tmp/file.0201-11.nc
tmp/file.0201-04.nc  tmp/file.0201-08.nc  tmp/file.0201-12.nc

In the command above I have used echo to print the value of the $year loop variable to highlight this is 3 successive invocations of ls.

Using this approach it might also be necessary to account for cases where there is no match for a specific year.

If you’re looping over the values of year explicitly it might just be simpler to generate the filenames directly rather than using ls.

Also, seq is a commonly installed GNU utility, not a builtin bash command, so it can’t be guaranteed to always be present.

Brace expansion

The most concise and attractive way is brace exapansion. This is what I used above to generate the test files initially.

To see how it works with a range, as above:

$ echo {0199..0201}
0199 0200 0201

so why not just use the variables defined above directly in a brace expansion?

$ echo {$begin..$end}
{0199..0201}

It doesn’t do the brace expansion, just variable substitution. So this won’t work.

To make it work you need to use eval

$ eval echo {$begin..$end}
0199 0200 0201

which evaluates the expression after the variable expansion has taken place. To use this to get a list of the matching files:

$ eval ls tmp/*{$begin..$end}-*.nc
tmp/file.0199-01.nc  tmp/file.0200-01.nc  tmp/file.0201-01.nc
tmp/file.0199-02.nc  tmp/file.0200-02.nc  tmp/file.0201-02.nc
tmp/file.0199-03.nc  tmp/file.0200-03.nc  tmp/file.0201-03.nc
tmp/file.0199-04.nc  tmp/file.0200-04.nc  tmp/file.0201-04.nc
tmp/file.0199-05.nc  tmp/file.0200-05.nc  tmp/file.0201-05.nc
tmp/file.0199-06.nc  tmp/file.0200-06.nc  tmp/file.0201-06.nc
tmp/file.0199-07.nc  tmp/file.0200-07.nc  tmp/file.0201-07.nc
tmp/file.0199-08.nc  tmp/file.0200-08.nc  tmp/file.0201-08.nc
tmp/file.0199-09.nc  tmp/file.0200-09.nc  tmp/file.0201-09.nc
tmp/file.0199-10.nc  tmp/file.0200-10.nc  tmp/file.0201-10.nc
tmp/file.0199-11.nc  tmp/file.0200-11.nc  tmp/file.0201-11.nc
tmp/file.0199-12.nc  tmp/file.0200-12.nc  tmp/file.0201-12.nc

Without eval:

$ ls tmp/*{$begin..$end}-*.nc
ls: cannot access tmp/*{0199..0201}-*.nc: No such file or directory

So you can get a list that you can iterate over using a subshell, like so:

$ for file in $(eval ls tmp/*{$begin..$end}-*.nc); do echo $file; done
tmp/file.0199-01.nc
tmp/file.0199-02.nc
tmp/file.0199-03.nc
tmp/file.0199-04.nc
tmp/file.0199-05.nc
tmp/file.0199-06.nc
tmp/file.0199-07.nc
tmp/file.0199-08.nc
tmp/file.0199-09.nc
tmp/file.0199-10.nc
tmp/file.0199-11.nc
tmp/file.0199-12.nc
tmp/file.0200-01.nc
tmp/file.0200-02.nc
tmp/file.0200-03.nc
tmp/file.0200-04.nc
tmp/file.0200-05.nc
tmp/file.0200-06.nc
tmp/file.0200-07.nc
tmp/file.0200-08.nc
tmp/file.0200-09.nc
tmp/file.0200-10.nc
tmp/file.0200-11.nc
tmp/file.0200-12.nc
tmp/file.0201-01.nc
tmp/file.0201-02.nc
tmp/file.0201-03.nc
tmp/file.0201-04.nc
tmp/file.0201-05.nc
tmp/file.0201-06.nc
tmp/file.0201-07.nc
tmp/file.0201-08.nc
tmp/file.0201-09.nc
tmp/file.0201-10.nc
tmp/file.0201-11.nc
tmp/file.0201-12.nc

Some people are wary of using eval. Anything that is in that statement will be executed, and they worry about malicious code being injected. In this case it isn’t likely to be a problem, but bear this in mind

sed

If eval worries you, or you just like the idea of doing yet another way, another option is to use the stream editor sed.

The command below is listing all the files, and then piping (|) the result into sed. The sed script matches, and starts printing the input, when it encounters the first pattern, and stops after it encounters the second

$ ls tmp/* | sed -n "/$begin-01/,/$end-12/p"
tmp/file.0199-01.nc
tmp/file.0199-02.nc
tmp/file.0199-03.nc
tmp/file.0199-04.nc
tmp/file.0199-05.nc
tmp/file.0199-06.nc
tmp/file.0199-07.nc
tmp/file.0199-08.nc
tmp/file.0199-09.nc
tmp/file.0199-10.nc
tmp/file.0199-11.nc
tmp/file.0199-12.nc
tmp/file.0200-01.nc
tmp/file.0200-02.nc
tmp/file.0200-03.nc
tmp/file.0200-04.nc
tmp/file.0200-05.nc
tmp/file.0200-06.nc
tmp/file.0200-07.nc
tmp/file.0200-08.nc
tmp/file.0200-09.nc
tmp/file.0200-10.nc
tmp/file.0200-11.nc
tmp/file.0200-12.nc
tmp/file.0201-01.nc
tmp/file.0201-02.nc
tmp/file.0201-03.nc
tmp/file.0201-04.nc
tmp/file.0201-05.nc
tmp/file.0201-06.nc
tmp/file.0201-07.nc
tmp/file.0201-08.nc
tmp/file.0201-09.nc
tmp/file.0201-10.nc
tmp/file.0201-11.nc
tmp/file.0201-12.nc

As above, this can be used in a bash loop like so:

$ for file in $(ls tmp/* | sed -n "/$begin-01/,/$end-12/p"); do echo $file; done
tmp/file.0199-01.nc
tmp/file.0199-02.nc
tmp/file.0199-03.nc
tmp/file.0199-04.nc
tmp/file.0199-05.nc
tmp/file.0199-06.nc
tmp/file.0199-07.nc
tmp/file.0199-08.nc
tmp/file.0199-09.nc
tmp/file.0199-10.nc
tmp/file.0199-11.nc
tmp/file.0199-12.nc
tmp/file.0200-01.nc
tmp/file.0200-02.nc
tmp/file.0200-03.nc
tmp/file.0200-04.nc
tmp/file.0200-05.nc
tmp/file.0200-06.nc
tmp/file.0200-07.nc
tmp/file.0200-08.nc
tmp/file.0200-09.nc
tmp/file.0200-10.nc
tmp/file.0200-11.nc
tmp/file.0200-12.nc
tmp/file.0201-01.nc
tmp/file.0201-02.nc
tmp/file.0201-03.nc
tmp/file.0201-04.nc
tmp/file.0201-05.nc
tmp/file.0201-06.nc
tmp/file.0201-07.nc
tmp/file.0201-08.nc
tmp/file.0201-09.nc
tmp/file.0201-10.nc
tmp/file.0201-11.nc
tmp/file.0201-12.nc

Conclusion

There are often many ways to accomplish even the simplest tasks, but sometimes the hardest thing to know is what not to do.

Note also that this is very specific to bash. Other shells will have their own limits and abilities.

tags: