Download lots of small files Download lots of small files curl curl

Download lots of small files


It is the latency that will do you in. In a normal, sequential process, if there is a latency involved of 1-3 seconds per file, you will pay them all, one after the other and spend 1-3 million seconds downloading a million files.

The trick is to pay the latencies in parallel - put out, say 64, parallel requests and wait for 1-3 seconds for them all to return - instead of the 180 seconds if done sequentially.

I would commend GNU Parallel to you, which although of Unix origin, runs under Cygwin. Please look up some tutorials.

The command will be something like this to do 64 curls at a time:

parallel -j 64 -a filelist.txt curl {}


You can use the aria2 download utility with:

  • the -j [NUMBER] option for concurrent downloads
  • the -i [FILENAME] option to provide the URLs and output file names in a text file

For example, assume files.txt contains:

http://rakudo.org/downloads/star/rakudo-star-2017.01.tar.gz    out=test1.filehttp://rakudo.org/downloads/star/rakudo-star-2017.01.dmg    out=test2.filehttp://rakudo.org/downloads/star/rakudo-star-2017.01-x86_64%20(JIT).msi    out=test3.filehttp://rakudo.org/downloads/star/rakudo-star-2016.11.tar.gz    out=test4.file

Then you would just run e.g. aria2c -j4 -i files.txt to download all those files in parallel. Not sure how this performs with millions of small files though - but I guess it's worth a shot.


With curl you only need a file with the format

output = filename1.jpgurl = http://....output = filename2.jpgurl = http://....

and use the -K file switch to process it or dynamically generate it and read the list from standard input with -K -.

So, from a url list you can try with this code

@echo off    setlocal enableextensions disabledelayedexpansion    set "count=0"    (for /f "usebackq delims=" %%a in ("urlList.txt") do @(        >nul set /a "count+=1"        call echo(output = file%%^^count%%.jpg        echo(url = %%a    )) | curl -K -

Or, for really big url lists (for /f needs to load the full file into memory) you can use

@echo off    setlocal enableextensions disabledelayedexpansion    < urlList.txt (        cmd /e /v /q /c"for /l %%a in (1 1 2147483647) do set /p.=&&(echo(output = file%%a.jpg&echo(url = !.!)||exit"    ) | curl -K - 

notes:

  1. As arithmetic operations in batch files are limited to values lower than 231, those samples will fail if your lists contain more than 2147483647 urls.

  2. First sample will fail with urls longer than aprox. 8180 characters

  3. Second sample will fail with urls longer than 1021 characters and will terminate on empty lines in source file.