Digitised Books. c. 1510 - c. 1900. JSON (OCR derived text) - British Library Research Repository
Shared Research Repository

Digitised Books. c. 1510 - c. 1900. JSON (OCR derived text)



The dataset comprises text created by OCR from the 49,455 digitised books, equating to 65,227 volumes (25+ million pages), published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature. The dataset is in JavaScript Object Notation (JSON) text format. Links metadata, PDFs, Flickr images, digital versions


There is 1 file associated with this work, which is available for download.


  • Resource type


  • Collections
  • Contributors
    • Edwards, Adrian
  • Institution
    • British Library

  • Publisher
    • British Library

  • Place of publication
    • London, UK

  • Official URL
  • Licence
  • DOI
    • doi.org/10.21250/db14

  • Alternate identifier
      • Alternate identifier: DAR00147
      • type: Digital Asset Register ID
  • Keywords
  • Additional information
    • The 10.5 GB .bz2 file contains page level JSON formatted and OCR derived text. Each file is equivalent to a volume in the collection and the file is named to reflect the identifier and volume of the print work it is taken from. Each JSON file is simply a hash of the page number to the text found on that page.