Index Catalog // British Library

2020

Conference paper (unpublished)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of...

van Strien, Daniel ; Beelen, Kaspar ; Coll Ardanuy, Mariona ; Hosseini, Kasra ; McGillivray, Barbara …

Natural Language Processing, OCR, Optical Character Recognition, information retrieval, NLP, and digital humanities

2023

Dataset

DeezyMatch training set for OCR

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....

Coll Ardanuy, Mariona ; Nanni, Federico ; Pedrazzini, Nilo

OCR, fuzzy string matching, string variation, newspapers, digital humanities, natural language processing, DeezyMatch, and Living with Machines

2020

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, Indian, and transcription

2021

Dataset

Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)

The dataset comprises metadata and OCR generated text from 49,455 digitised books published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature. The dataset is in JSON Lines (JSONL) text format.

British Library Labs ; British Library

OCR and monographs

2020

Dataset

al-Durr al-naqī fī fann al-mūsīqī (Add MS 23494)

This dataset is a PDF file containing the images and transcription the manuscript titled al-Durr al-naqī fī fann al-mūsīqī الدرّ النقيّ في فنّ الموسيقي by Aḥmad ibn 'Abd al-Raḥmān al-Mawṣilī أحمد بن عبد الرحمن الموصلي. The manuscript was digitised through the British Library Qatar Foundation Partnership, and made available through...

British Library ; Keinan-Schoonbaert, Adi

transcription, Arabic, and OCR

2019

Dataset

Ground Truth transcriptions for training OCR of historical Arabic handwritten texts

This dataset comprises 120 digitised images (TIFF files) drawn from a selection of historical Arabic scientific manuscripts (10th-19th century) digitised through the British Library Qatar Foundation Partnership. Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition (OCR) or handwritten text...

British Library ; Keinan-Schoonbaert, Adi

Arabic, transcription, and OCR

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Transkribus

This dataset comprises 74 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, transcription, and Indian

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Recognition of Early Indian Printed Documents competition

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

Indian, transcription, and OCR

2015

Dataset

Volumes of Lysons Collectanea (Amusements), comprising broadsides, cuttings, advertisements on amusements 1660-1840

The dataset comprises nine digitised volumes of a collection of broadsides, cuttings and advertisements, relating to public exhibitions and places of amusement from 1660 - 1840 (with OCR-derived text.) Part of the Lysons Collectanea collection.

British Library

amusements, text, newspapers, broadsides, OCR, and adverts

2015

Dataset

Volumes of Lysons Collectanea (Trades), comprising advertisements, cuttings, and illustrations relating to trades, professions, medical cures. 1660-1825.

The dataset comprises the OCR text derived from four digitised volumes of a collection of advertisements, cuttings and illustrations relating to trades, professions and medical cures from 1660 - 1825.

British Library

text, newspapers, OCR, trades, and adverts

Research Repository

2020

Conference paper (unpublished)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

2023

Dataset

DeezyMatch training set for OCR

2020

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

2021

Dataset

Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)

2020

Dataset

al-Durr al-naqī fī fann al-mūsīqī (Add MS 23494)

2019

Dataset

Ground Truth transcriptions for training OCR of historical Arabic handwritten texts

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Transkribus

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Recognition of Early Indian Printed Documents competition

2015

Dataset

Volumes of Lysons Collectanea (Amusements), comprising broadsides, cuttings, advertisements on amusements 1660-1840

2015

Dataset

Volumes of Lysons Collectanea (Trades), comprising advertisements, cuttings, and illustrations relating to trades, professions, medical cures. 1660-1825.

Limit your search

Type

Resource Type

Creator

Keyword

Language

Collection

Institution

Availability

Research Repository

Search Constraints

Search Results

2020

Conference paper (unpublished)

2023

Dataset

2020

Dataset

2021

Dataset

2020

Dataset

2019

Dataset

2019

Dataset

2019

Dataset

2015

Dataset

2015

Dataset

Limit your search