WGET: download an HTTP or FTP directory recursively

Many science data sites are an Apache server with a HTTP index.html file listing, perhaps recursively, a lot of files and directories. You can download them recursively to your PC as follows (use this responsibly, you can consume gigabytes of data at cost to science host or yourself!)


wget -r -np -nc -nH --cut-dirs=4 --random-wait --wait 1 -e robots=off http://mysite.com/aaa/bbb/ccc/ddd/

This plops the files to whatever directory you ran the command in.

To use this on FTP site, just change the http:// to ftp:// and type the proper ftp site address.

Option explanation:
-r: download recursively (and place in recursive folders on your PC)
-np: Never get parent directories (sometimes a site will link back up and you don’t want that)
-nc: no clobber — don’t re-download files you already have
-nH: don’t put obnoxious site name directories on your PC
–cut-dirs=4: don’t put an obnoxious hierarchy of directories above the desired directory on your PC. Note you must set the number equal to the number of directories on server (here aaa/bbb/ccc/ddd is four)
-e robots=off: Many sites will block robots from mindlessly consuming huge amounts of data. Here we override this setting telling Apache that we’re (somewhat) human.
–random-wait: To avoid excessive download requests (that can get you auto-banned from downloading) we politely wait in-between files–better than trying to get yourself un-banned!
–wait 1: making the random wait time average to about 1 second before starting to download the next file.