Index Catalog // British Library

2023

Dataset

DeezyMatch training set for OCR

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....

Coll Ardanuy, Mariona ; Nanni, Federico ; Pedrazzini, Nilo

OCR, fuzzy string matching, string variation, newspapers, digital humanities, natural language processing, DeezyMatch, and Living with Machines

2021

Dataset

Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)

The dataset comprises metadata and OCR generated text from 49,455 digitised books published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature. The dataset is in JSON Lines (JSONL) text format.

British Library Labs ; British Library

OCR and monographs

2020

Dataset

al-Durr al-naqī fī fann al-mūsīqī (Add MS 23494)

This dataset is a PDF file containing the images and transcription the manuscript titled al-Durr al-naqī fī fann al-mūsīqī الدرّ النقيّ في فنّ الموسيقي by Aḥmad ibn 'Abd al-Raḥmān al-Mawṣilī أحمد بن عبد الرحمن الموصلي. The manuscript was digitised through the British Library Qatar Foundation Partnership, and made available through...

British Library ; Keinan-Schoonbaert, Adi

transcription, Arabic, and OCR

2020

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, Indian, and transcription

2019

Dataset

Ground Truth transcriptions for training OCR of historical Arabic handwritten texts

This dataset comprises 120 digitised images (TIFF files) drawn from a selection of historical Arabic scientific manuscripts (10th-19th century) digitised through the British Library Qatar Foundation Partnership. Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition (OCR) or handwritten text...

British Library ; Keinan-Schoonbaert, Adi

Arabic, transcription, and OCR

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Recognition of Early Indian Printed Documents competition

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

Indian, transcription, and OCR

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Transkribus

This dataset comprises 74 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, transcription, and Indian

2015

Dataset

Volumes of performances connecting Sir Henry Irving. 1879 - 1905.

Sir Henry Irving's American and Provincial Tours 1883 - 1905; miscellaneous performances, including some given by Royal Command, 1883 - 1903; Lyceum Theatre 1879 – 1902; and Drury Lane Theatre, 1903 and 1905. The collection was formed by Bram Stoker.

British Library

Henry, text, Irving, OCR, and plays

2015

Dataset

Theatrical playbills from Britain and Ireland (OCR text only)

The dataset comprises 264 volumes of digitised theatrical playbills published between 1660 – 1902 (mostly 19th century) from England, Scotland, Wales and Ireland. Digitised from the British Library's physical collection of over 500 volumes of playbills. The dataset contains text files (.TXT) in Optical Character Recognition (OCR) format. The playbills...

British Library Labs

singlesheet, text, playbill, OCR, and playbills

2015

Dataset

Portraits of actors, views of theatres and playbills (covering 1750 - 1821 in a single volume)

166 page PDF of collated portraits and views (with OCR-derived text) The dataset comprises one digitised volume (166 pages) of a collection of portraits of celebrated actors and actresses, views of theatres and playbills, dating 1750 - 1821. The dataset is in Portable Document Format (PDF).

British Library

text, theatres, views, portraits, actors, OCR, and playbills

Research Repository

2023

Dataset

DeezyMatch training set for OCR

2021

Dataset

Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)

2020

Dataset

al-Durr al-naqī fī fann al-mūsīqī (Add MS 23494)

2020

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

2019

Dataset

Ground Truth transcriptions for training OCR of historical Arabic handwritten texts

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Recognition of Early Indian Printed Documents competition

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Transkribus

2015

Dataset

Volumes of performances connecting Sir Henry Irving. 1879 - 1905.

2015

Dataset

Theatrical playbills from Britain and Ireland (OCR text only)

2015

Dataset

Portraits of actors, views of theatres and playbills (covering 1750 - 1821 in a single volume)

Limite su búsqueda

Type

Resource Type

Creator

Palabra clave

Idioma

Colección

Institution

Availability

Research Repository

Buscar

Resultados de la búsqueda

2023

Dataset

2021

Dataset

2020

Dataset

2020

Dataset

2019

Dataset

2019

Dataset

2019

Dataset

2015

Dataset

2015

Dataset

2015

Dataset

Limite su búsqueda