Bulk Downloading Items from the Internet Archive Using Wget

Installing Wget on Mac OS X

Wget is a free software package for retrieving files using HTTP, HTTPS and FTP. It is not installed on Mac OS X by default, this tutorial will explain how to easily install a compiled version of Wget. The source of the compiled version of Wget that we will be using can be found on Status-Q. The files necessary (wget.zip) are also attached at the bottom of this page. Full documentation of wget is available (in several formats) at http://www.gnu.org/software/wget/manual/.

  • Download wget.zip, located at the bottom of this page.
  • Unzip the file by double clicking it.
  • Open up Terminal (located in Applications).
  • Move to the "wget" directory you just unzipped (i.e. if you unzipped the file in your downloads folder, enter "cd ~/Downloads/wget" into the terminal and hit return).
  • Copy wget into /usr/local/bin using the following command:

cp wget /usr/local/bin

  • Copy wget.1 into /usr/local/man/man1 using the following command:

cp wget.1 /usr/local/man/man1

  • Copy wgetrc into /usr/local/etc using the following command:

cp wgetrc /usr/local/etc

  • Some of these directories may not exist (most likely "/usr/local/man/man1"). if they do not simply create the directory with the following command before you copy the files:

sudo mkdir /usr/local/man

  • You will need to enter your password and hit return. Repeat for any directories that need to be created.
  • In terminal enter the command "which wget". Terminal should print out something like "/usr/local/bin/wget". This means you have correctly installed wget.

Bulk Downloading Items from the Internet Archive

Below is a wget command, with explanation of each of the options. The general idea is to go to a URL like http://www.archive.org/download/{identifier} for each item to be downloaded (which redirects to the item's directory listing on a datanode) and follow all the links to individual files (and sub-directories, if any) from there.

  • Create a directory on your local computer for the files to be downloaded into.
  • Make an "itemlist" file in the target directory's parent directory, listing the Archive items to be fetched, one per line, just like an auto_submit itemlist.
  • cd to the target directory, and issue the wget command:

wget -r -H -nc -np -nH --cut-dirs=2 -e robots=off -i ../itemlist -B 'http://www.archive.org/download/'

  • Explanation of each option in the wget command:

-r recursive download; required in order to move from the item identifier down into its individual files

-H enable spanning across hosts when doing recursive retrieving (the initial URL for the directory will be on www.archive.org, and the individual file locations will be on a specific datanode)

-nc no clobber; if a local copy already exists of a file, don't download it again (useful if you have to restart the wget at some point, as it avoids re-downloading all the files that were already done during the first pass)

-np no parent; ensures that the recursion doesn't climb back up the directory tree to other items (by, for instance, following the "../" link in the directory listing)

-nH no host directories; when using -r, wget will create a directory tree to stick the local copies in, starting with the hostname ({datanode}.us.archive.org/), unless -nH is provided

--cut-dirs=2 completes what -nH started by skipping the hostname; when saving files on the local disk (from a URL like http://{datanode}.us.archive.org/{drive}/items/{identifier}/{identifier}.pdf), skip the /{drive}/items/ portion of the URL, too, so that all {identifier} directories appear together in the current directory, instead of being buried several levels down in multiple {drive}/items/ directories

-e robots=off archive.org datanodes contain robots.txt files telling robotic crawlers not to traverse the directory structure; in order to recurse from the directory to the individual files, we need to tell wget to ignore the robots.txt directive

-i ../itemlist location of input file listing all the URLs to use; "../itemlist" means the list of items should appear one level up in the directory structure, in a file called "itemlist" (you can call the file anything you want, so long as you specify its actual name after -i)

-B 'http://www.archive.org/download/' base URL; gets prepended to the text read from the -i file (this is what allows us to have just the identifiers in the itemlist file, rather than the full URL on each line)

  • Additional options that may be needed sometimes:

--header "Cookie: logged-in-user={user}%40archive.org; logged-in-sig={private};" provides the login credentials for a privileged account; needed when downloading restricted content

-A -R accept-list and reject-list, either limiting the download to certain kinds of file, or excluding certain kinds of file; for instance, -R _orig_jp2.tar,_jpg.pdf would download all files except those whose names end with _orig_jp2.tar or _jpg.pdf, and -A "*zelazny*" -R .ps would download all files containing zelazny in their names, except those ending with .ps. See http://www.gnu.org/software/wget/manual/html_node/Types-of-Files.html for a fuller explanation.

  • If you need to stop or restart the wget task, press "control" and "c" on your keyboard at the same time, while in the terminal window.

  • Dealing with Google books!
    • If you use the command above on a google book (identifiers end in *goog), you will get a bunch of junky files from the _desc.html file.
    • USING -R _desc.html DOES NOT WORK
    • To get around this, insert -l 1 into the command to limit recursion depth to 1 (that's a lower case L in -l)
    • EXAMPLE: wget -r -H -nc -np -nH --cut-dirs=2 -l 1 -e robots=off -i ../itemlist -B 'http://www.archive.org/download/'
      • NOTE: this option will make it impossible for you to grab subdirectories, so make sure your items don't have them.

To be sorted...

  • Make sure computer never sleeps
  • turn off auto update
  • use gigabit connection
  • figure out a way to "audit" what we downloaded against the itemlist.
  • Use PC for downloads
  • Use Cygwin for wget/"terminal"
  • if downloading full book items with all their files, expect to get about 50 items per hour
  • you can't copy/paste into cygwin window - use mintty instead (can install this with cygwin)
  • NOTE: this isn't a total solution to the "how do we QA disks?" question.... but you can simply rerun the wget command to try to catch stuff that was missed. Rerun it until nothing else is downloaded.
  • for China books, include an excel file with metadata for the books on the disk. Might be nice to do this regardless.
  • documentation should include link to DisksShipped page for tracking bulk download tasks and deliveries

-- JakeJohnson - 12 Feb 2011

Topic attachments
I Attachment Action Size Date Who Comment
Compressed Zip archivezip wget.zip manage 251.2 K 2011-02-14 - 19:26 JakeJohnson Wget Binary
Topic revision: r8 - 2011-02-24 - AlexisRossi
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2011 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback