We present two datasets, one for the task of toponym recognition and one for the task of toponym disambiguation. The datasets are derived from the "Dataset for Toponym Resolution in Nineteenth-Century English Newspapers" (DOI: https://doi.org/10.23636/r7d4-kw08). The toponym recognition dataset consists of two JSON files (ner_fine_train.json and ner_fine_dev.json), whereas the toponym...
Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....