How to get a list of available files using wget or curl? How to get a list of available files using wget or curl? curl curl

How to get a list of available files using wget or curl?


You can't do the equivalent of an ls unless the server provides such listings itself. You could however retrieve index.html and then check for includes, e.g. something like

wget -O - http://www.example.com | grep "type=.\?text/javascript.\?"

Note that this relies on the HTML being formatted in a certain way -- in this case with the includes on individual lines for example. If you want to do this properly, I'd recommend parsing the HTML and extracting the javascript includes that way.


Let's consider this open directory (http://tug.ctan.org/macros/latex2e/required/amscls/) as the object of our experimentation. This directory belongs to the Comprehensive TeX Archive Network, so don't be too worried about downloading malicious files.

Now, let's suppose that we want to list all files whose extension is pdf. We can do so by executing the following command.

The command shown below will save the output of wget in the file main.log. Because wget send a request for each file and it prints some information about the request, we can then grep the output to get a list of files which belong to the specified directory.

wget \  --accept '*.pdf' \  --reject-regex '/\?C=[A-Z];O=[A-Z]$' \  --execute robots=off \  --recursive \  --level=0 \  --no-parent \  --spider \  'http://tug.ctan.org/macros/latex2e/required/amscls/doc/' 2>&1 | tee main.log

Now, we can list the files whose extension is pdf by using grep.

grep '^--' main.log
--2020-11-23 10:39:46--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsbooka.pdf--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsclass.pdf--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsdtx.pdf--2020-11-23 10:39:47--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsmidx.pdf--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/amsthdoc.pdf--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/thmtest.pdf--2020-11-23 10:39:48--  http://tug.ctan.org/macros/latex2e/required/amscls/doc/upref.pdf

Note that we could also get the list of all files in the directory and then execute grep on the output of the command. However, doing this would have taken more time since apparently a request is sent for each file. By using the --accept, we can make wget send a request for only those files in which we are interested in.

Last but not least, the sizes of the files are saved in the file main.log, so you can check that information in that file.