Google Books Dataset

Data Access

The dataset is available to download in full or in part by on-campus users. Authorized MSU faculty and staff may also access the dataset while off campus by connecting to the campus VPN. For more information on how best to access the collection, visit the help page.

Description

The Google Dataset (GDS) is a collection of scanned books, totaling approximately 3 million volumes of text, or 2.9 terabytes (2,970 gigabytes) of data. The books included in the dataset are public domain works digitized by Google and made available by the Hathi Trust Digital Library. All volumes are stored in plain text files (not scanned page-image files).

The dataset is not meant to be used as a source for reading material, but rather as a linguistic set for text mining or other "non-consumptive" research, that is, research conducted by computational methods which does not reproduce significant portions of text for personal or public display. The terms of the contract with Google that make this corpus available strictly prohibit publishing the texts that comprise the dataset.

Additionally, if you plan to present work publicly that makes use of data gathered through MSU's Google Dataset, please contact Devin Higgins before doing so for important further instructions on how to complete required paperwork with HathiTrust, and for information on how to cite the dataset.

Data Summary

Format

For each volume in the Google Books dataset, there is a zipped archive containing one text file for each page in the volume along with an XML file containing technical and preservation metadata. Descriptive metadata for all items in the collection is located in a single compressed file named meta.tar.gz in the root directory. Using the subsetting tool, however, provides further and more convenient options for downloading files in zipped or unzipped format and for accessing text, descriptive metadata, and technical information in user-created bundles.

File Naming Conventions

Volumes downloaded via the subsetting tool will be stored in text files named according to a name-title-identifier convention.

Files accessed directly via the directory structure will be stored in a folder named according to the identifier of the object, with a separate text file for each page in the volume. Additionally, the path in the directory structure leading to individual volumes is generated according to the pairtree system, where the path is derived in a specific, systematic way from the item's unique identifier.

Size

metadata - 500 MB compressed, approximately 11 GB uncompressed.
plain-text - 4.6 TB compressed

Data Quality

The quality of the scanned text varies widely across the collection; in general, more recently scanned works should be of higher quality.

Full bibliographic metadata for all works in the collection is available in MARCXML format. Technical and preservation metadata describing the provenance for all digital files is also available for download in a METS XML wrapper.