Digitised Books. c. 1510 - c. 1900. JSON (OCR derived text) - British Library Research Repository
Shared Research Repository
Dataset

Digitised Books. c. 1510 - c. 1900. JSON (OCR derived text)

2014

Abstract

The dataset comprises text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature. The dataset is in JavaScript Object Notation (JSON) text format. Links metadata, PDFs, Flickr images, digital versions

Files

There is 1 file associated with this work, which is available for download.

Metadata

  • Resource type

    Dataset

  • Collections
  • Contributors
    • Edwards, Adrian
  • Institution
    • British Library

  • Publisher
    • British Library

  • Place of publication
    • London, UK

  • Official URL
  • Licence
  • DOI
    • doi.org/10.21250/db14

  • Alternate identifier
      • Alternate identifier: DAR00147
      • type: Digital Asset Register ID
  • Keywords
  • Additional information
    • The 10.5 GB .bz2 file contains page level JSON formatted and OCR derived text. Each file is equivalent to a volume in the collection and the file is named to reflect the identifier and volume of the print work it is taken from. Each JSON file is simply a hash of the page number to the text found on that page.