Index Catalog // British Library

2023

Dataset

DeezyMatch training set for OCR

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....

Coll Ardanuy, Mariona ; Nanni, Federico ; Pedrazzini, Nilo

OCR, fuzzy string matching, string variation, newspapers, digital humanities, natural language processing, DeezyMatch, and Living with Machines

2021

Dataset

Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)

The dataset comprises metadata and OCR generated text from 49,455 digitised books published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature. The dataset is in JSON Lines (JSONL) text format.

British Library Labs ; British Library

OCR and monographs

2020

Dataset

al-Durr al-naqī fī fann al-mūsīqī (Add MS 23494)

This dataset is a PDF file containing the images and transcription the manuscript titled al-Durr al-naqī fī fann al-mūsīqī الدرّ النقيّ في فنّ الموسيقي by Aḥmad ibn 'Abd al-Raḥmān al-Mawṣilī أحمد بن عبد الرحمن الموصلي. The manuscript was digitised through the British Library Qatar Foundation Partnership, and made available through...

British Library ; Keinan-Schoonbaert, Adi

transcription, Arabic, and OCR

2020

Conference paper (unpublished)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of...

van Strien, Daniel ; Beelen, Kaspar ; Coll Ardanuy, Mariona ; Hosseini, Kasra ; McGillivray, Barbara …

Natural Language Processing, OCR, Optical Character Recognition, information retrieval, NLP, and digital humanities

2020

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, Indian, and transcription

2019

Conference paper (published)

Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector

The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich and unique historical content on an ever increasing scale. However, particularly for historical material...

Derrick, Tom ; McGregor, Nora

HTR, page analysis, layout analysis, recognition, Bangla script, Arabic script, OCR, and datasets

2019

Dataset

Ground Truth transcriptions for training OCR of historical Arabic handwritten texts

This dataset comprises 120 digitised images (TIFF files) drawn from a selection of historical Arabic scientific manuscripts (10th-19th century) digitised through the British Library Qatar Foundation Partnership. Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition (OCR) or handwritten text...

British Library ; Keinan-Schoonbaert, Adi

Arabic, transcription, and OCR

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Transkribus

This dataset comprises 74 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, transcription, and Indian

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Recognition of Early Indian Printed Documents competition

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

Indian, transcription, and OCR

Ground Truth transcriptions for training OCR of historical Bengali printed texts

User Collection

transcription, OCR, and text recognition

Research Repository

2023

Dataset

DeezyMatch training set for OCR

2021

Dataset

Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)

2020

Dataset

al-Durr al-naqī fī fann al-mūsīqī (Add MS 23494)

2020

Conference paper (unpublished)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

2020

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

2019

Conference paper (published)

Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector

2019

Dataset

Ground Truth transcriptions for training OCR of historical Arabic handwritten texts

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Transkribus

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Recognition of Early Indian Printed Documents competition

Ground Truth transcriptions for training OCR of historical Bengali printed texts

Limit your search

Type

Resource Type

Creator

Keyword

Language

Collection

Institution

Availability

Research Repository

Search Constraints

Search Results

2023

Dataset

2021

Dataset

2020

Dataset

2020

Conference paper (unpublished)

2020

Dataset

2019

Conference paper (published)

2019

Dataset

2019

Dataset

2019

Dataset

Limit your search