Index Catalog // British Library

Research Repository

Cancellare i filtri

Filtro per: Creator Pedrazzini, Nilo Parola chiave newspapers Lingua English

2023

Dataset

DeezyMatch training set for OCR

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....

Coll Ardanuy, Mariona ; Nanni, Federico ; Pedrazzini, Nilo
OCR, fuzzy string matching, string variation, newspapers, digital humanities, natural language processing, DeezyMatch, and Living with Machines
2022

Dataset

Diachronic word embeddings from 19th-century newspapers digitised by the British Library (1800-1919)

Word vectors related to the paper "Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers" by Nilo Pedrazzini and Barbara McGillivray (2022). The embeddings were trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and specific parameters. The embeddings are divided into...

Pedrazzini, Nilo ; McGillivray, Barbara
historical semantics, word-vectors, late-modern-english, newspapers, diachronic-embeddings, and word2vec