Index Catalog // British Library

Ground Truth transcriptions for training OCR of historical Bengali printed texts

User Collection

transcription, OCR, and text recognition

2019

Conference paper (published)

Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector

The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich and unique historical content on an ever increasing scale. However, particularly for historical material...

Derrick, Tom ; McGregor, Nora

HTR, page analysis, layout analysis, recognition, Bangla script, Arabic script, OCR, and datasets

2015

Dataset

Theatrical playbills from Britain and Ireland (OCR text only)

The dataset comprises 264 volumes of digitised theatrical playbills published between 1660 – 1902 (mostly 19th century) from England, Scotland, Wales and Ireland. Digitised from the British Library's physical collection of over 500 volumes of playbills. The dataset contains text files (.TXT) in Optical Character Recognition (OCR) format. The playbills...

British Library Labs

singlesheet, text, playbill, OCR, and playbills

2015

Dataset

Portraits of actors, views of theatres and playbills (covering 1750 - 1821 in a single volume)

166 page PDF of collated portraits and views (with OCR-derived text) The dataset comprises one digitised volume (166 pages) of a collection of portraits of celebrated actors and actresses, views of theatres and playbills, dating 1750 - 1821. The dataset is in Portable Document Format (PDF).

British Library

text, theatres, views, portraits, actors, OCR, and playbills

2015

Dataset

Volumes of signs of taverns in England and Wales. 1628 - 1858

The dataset comprises 14 digitised volumes (as PDFs) of a collection of tavern signs in and England and Wales dating 1628 – 1858 (with OCR-derived text.)

British Library

text, Wales, taverns, pubs, signs, England, and OCR

2015

Dataset

Volumes of portraits and biographies of officers in the South African wars collected by John Malcolm Bulloch. 1900 - 1902.

The dataset comprises six digitised volumes (in PDF) of a collection of portraits and biographical details of some officers distinguished in the South African War (1900 - 1902) (with OCR-derived text.) The collection was formed by John Malcolm Bulloch..

British Library

South Africa, text, portraits, war, army, OCR, biographies, and biography

2015

Dataset

Volumes of performances connecting Sir Henry Irving. 1879 - 1905.

Sir Henry Irving's American and Provincial Tours 1883 - 1905; miscellaneous performances, including some given by Royal Command, 1883 - 1903; Lyceum Theatre 1879 – 1902; and Drury Lane Theatre, 1903 and 1905. The collection was formed by Bram Stoker.

British Library

Henry, text, Irving, OCR, and plays

2015

Dataset

Volume of Christmas ballads and broadsides. 1750 - 1840

110 page PDF of miscellaneous Christmas ballads and prose broadsides (with OCR-derived text.) The dataset comprises one digitised volume (110 pages) of a collection of Christmas ballads and prose broadsides chiefly printed in London by J. Pitts between 1750 - 1840. The dataset is in Portable Document Format (PDF).

British Library

text, ballads, broadsides, OCR, prose, and Christmas

2015

Dataset

Volumes of Madden's cuttings, views, and pamphlets about the British Museum. 1755-1870.

The dataset comprises four digitised volumes of a collection of cuttings, views and pamphlets made by Sir Frederic Madden about the British Museum, dating 1755 - 1870 (with OCR-derived text.)

British Library

British Museum, text, and OCR

2015

Dataset

Volumes of Lysons Collectanea (Trades), comprising advertisements, cuttings, and illustrations relating to trades, professions, medical cures. 1660-1825.

The dataset comprises the OCR text derived from four digitised volumes of a collection of advertisements, cuttings and illustrations relating to trades, professions and medical cures from 1660 - 1825.

British Library

text, newspapers, OCR, trades, and adverts

2015

Dataset

Volumes of Lysons Collectanea (Amusements), comprising broadsides, cuttings, advertisements on amusements 1660-1840

The dataset comprises nine digitised volumes of a collection of broadsides, cuttings and advertisements, relating to public exhibitions and places of amusement from 1660 - 1840 (with OCR-derived text.) Part of the Lysons Collectanea collection.

British Library

amusements, text, newspapers, broadsides, OCR, and adverts

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Recognition of Early Indian Printed Documents competition

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

Indian, transcription, and OCR

2019

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts - Transkribus

This dataset comprises 74 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, transcription, and Indian

2019

Dataset

Ground Truth transcriptions for training OCR of historical Arabic handwritten texts

This dataset comprises 120 digitised images (TIFF files) drawn from a selection of historical Arabic scientific manuscripts (10th-19th century) digitised through the British Library Qatar Foundation Partnership. Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition (OCR) or handwritten text...

British Library ; Keinan-Schoonbaert, Adi

Arabic, transcription, and OCR

2020

Dataset

al-Durr al-naqī fī fann al-mūsīqī (Add MS 23494)

This dataset is a PDF file containing the images and transcription the manuscript titled al-Durr al-naqī fī fann al-mūsīqī الدرّ النقيّ في فنّ الموسيقي by Aḥmad ibn 'Abd al-Raḥmān al-Mawṣilī أحمد بن عبد الرحمن الموصلي. The manuscript was digitised through the British Library Qatar Foundation Partnership, and made available through...

British Library ; Keinan-Schoonbaert, Adi

transcription, Arabic, and OCR

2021

Dataset

Digitised Books. c. 1510 - c. 1900. JSONL (OCR derived text + metadata)

The dataset comprises metadata and OCR generated text from 49,455 digitised books published between c. 1510 - c. 1900. The books cover a wide range of subject areas including philosophy, history, poetry and literature. The dataset is in JSON Lines (JSONL) text format.

British Library Labs ; British Library

OCR and monographs

2020

Dataset

Ground Truth transcriptions for training OCR of historical Bengali printed texts – Recognition of Early Indian Printed Documents competition - updated with improved XML coordinates

This dataset comprises 81 digitised images (TIFF files) drawn from a selection of early printed Bengali books (1713-1914) digitised through the Two Centuries of Indian Print project (https://www.bl.uk/projects/two-centuries-of-indian-print). Also contained are ground truth transcriptions (XML) for each page that can be used for training optical character recognition software on historical...

British Library ; Derrick, Tom

OCR, Indian, and transcription

2023

Dataset

DeezyMatch training set for OCR

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....

Coll Ardanuy, Mariona ; Nanni, Federico ; Pedrazzini, Nilo

OCR, fuzzy string matching, string variation, newspapers, digital humanities, natural language processing, DeezyMatch, and Living with Machines

2020

Conference paper (unpublished)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of...

van Strien, Daniel ; Beelen, Kaspar ; Coll Ardanuy, Mariona ; Hosseini, Kasra ; McGillivray, Barbara …

Natural Language Processing, OCR, Optical Character Recognition, information retrieval, NLP, and digital humanities

Research Repository

Buscar

Resultados de la búsqueda

2019

Conference paper (published)

2015

Dataset

2015

Dataset

2015

Dataset

2015

Dataset

2015

Dataset

2015

Dataset

2015

Dataset

2015

Dataset

2015

Dataset

2019

Dataset

2019

Dataset

2019

Dataset

2020

Dataset

2021

Dataset

2020

Dataset

2023

Dataset

2020

Conference paper (unpublished)

Limite su búsqueda