Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is not installed on Mac OS X by default, this tutorial will explain how to easily install a compiled version of Wget. The source of the compiled version of Wget that we will be using can be found on Status-Q. The files necessary (wget.zip) are also attached at the bottom of this page. Full documentation of wget is available (in several formats) at http://www.gnu.org/software/wget/manual/.
cp wget /usr/local/bin
cp wget.1 /usr/local/man/man1
cp wgetrc /usr/local/etc
sudo mkdir /usr/local/man
Below is a wget command, with explanation of each of the options. The general idea is to go to a URL like http://www.archive.org/download/{identifier} for each item to be downloaded (which redirects to the item's directory listing on a datanode) and follow all the links to individual files (and sub-directories, if any) from there.
wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -i ../itemlist -B 'http://www.archive.org/download/'
-r recursive download; required in order to move from the item identifier down into its individual files
-H enable spanning across hosts when doing recursive retrieving (the initial URL for the directory will be on www.archive.org, and the individual file locations will be on a specific datanode)
-nc no clobber; if a local copy already exists of a file, don't download it again (useful if you have to restart the wget at some point, as it avoids re-downloading all the files that were already done during the first pass)
-np no parent; ensures that the recursion doesn't climb back up the directory tree to other items (by, for instance, following the "../" link in the directory listing)
-nH no host directories; when using -r, wget will create a directory tree to stick the local copies in, starting with the hostname ({datanode}.us.archive.org/), unless -nH is provided
--cut-dirs=2 completes what -nH started by skipping the hostname; when saving files on the local disk (from a URL like http://{datanode}.us.archive.org/{drive}/items/{identifier}/{identifier}.pdf), skip the /{drive}/items/ portion of the URL, too, so that all {identifier} directories appear together in the current directory, instead of being buried several levels down in multiple {drive}/items/ directories
-e robots=off archive.org datanodes contain robots.txt files telling robotic crawlers not to traverse the directory structure; in order to recurse from the directory to the individual files, we need to tell wget to ignore the robots.txt directive
-i ../itemlist location of input file listing all the URLs to use; "../itemlist" means the list of items should appear one level up in the directory structure, in a file called "itemlist" (you can call the file anything you want, so long as you specify its actual name after -i)
-B 'http://www.archive.org/download/' base URL; gets prepended to the text read from the -i file (this is what allows us to have just the identifiers in the itemlist file, rather than the full URL on each line)
--header "Cookie: logged-in-user={user}%40archive.org; logged-in-sig={private};" provides the login credentials for a privileged account; needed when downloading restricted content
-A -R accept-list and reject-list, either limiting the download to certain kinds of file, or excluding certain kinds of file; for instance, -R _orig_jp2.tar,_jpg.pdf would download all files except those whose names end with _orig_jp2.tar or _jpg.pdf, and -A "*zelazny*" -R .ps would download all files containing zelazny in their names, except those ending with .ps. See http://www.gnu.org/software/wget/manual/html_node/Types-of-Files.html for a fuller explanation.
-- JakeJohnson - 12 Feb 2011
I | Attachment | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|
![]() |
wget.zip | manage | 251.2 K | 2011-02-14 - 19:26 | JakeJohnson | Wget Binary |
![]() |
![]() |
|
![]() |
![]() |