Index Catalog // British Library

2023

Dataset

Language of Mechanisation: annotated historical newspaper articles

Datasets created through crowdsourcing tasks created on the Zooniverse crowdsourcing platform by the Living with Machines ‘language of mechanisation’ project team. Building on earlier work classifying machines by function, we asked volunteers on Zooniverse 'how did the word x change over time and place?' and presented them with options for...

British Library ; Ridge, Mia ; Pedrazzini, Nilo ; McGillivray, Barbara

crowdsourcing, 19th century British English, annotation, historical newspapers, mechanisation, data visualisation, historical semantics, and transport history

2023

Dataset

DeezyMatch training set for OCR

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....

Coll Ardanuy, Mariona ; Nanni, Federico ; Pedrazzini, Nilo

OCR, fuzzy string matching, string variation, newspapers, digital humanities, natural language processing, DeezyMatch, and Living with Machines

2022

Dataset

Diachronic word embeddings from 19th-century newspapers digitised by the British Library (1800-1919)

Word vectors related to the paper "Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers" by Nilo Pedrazzini and Barbara McGillivray (2022). The embeddings were trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and specific parameters. The embeddings are divided into...

Pedrazzini, Nilo ; McGillivray, Barbara

historical semantics, word-vectors, late-modern-english, newspapers, diachronic-embeddings, and word2vec

2023

Dataset

Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919)

Word embeddings trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and specific parameters. The embeddings are divided into periods of ten years each. Unlike those in this repository, these were not aligned and OCR errors skimmed from the vocabulary. See related GitHub repository for the full documentation:...

Pedrazzini, Nilo

historical semantics, British newspapers, word embeddings, word vectors, word2vec, and Late Modern English

2023

Dataset

Diachronic and diatopic word embeddings from newspapers digitised by the British Library (1830-1889): North and South England

Diachronic word embeddings (decade-level) trained with Word2Vec (via Gensim) on different geographic subcorpora of the Heritage Made Digital British and the Living with Machines historical newspaper collections: - North England (north.zip) - South England (south.zip) At the moment, for each subcorpus, Word2Vec models are available for each decade in the...

Pedrazzini, Nilo ; McGillivray, Barbara

historical semantics, diachronic embeddings, late modern English, word embeddings, word vectors, word2vec, and diatopic embeddings

Research Repository

2023

Dataset

Language of Mechanisation: annotated historical newspaper articles

2023

Dataset

DeezyMatch training set for OCR

2022

Dataset

Diachronic word embeddings from 19th-century newspapers digitised by the British Library (1800-1919)

2023

Dataset

Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919)

2023

Dataset

Diachronic and diatopic word embeddings from newspapers digitised by the British Library (1830-1889): North and South England

Affina la ricerca

Type

Resource Type

Creator

Parola chiave

Lingua

Collezione

Institution

Availability

Research Repository

Ricerca

Risultati della ricerca

2023

Dataset

2023

Dataset

2022

Dataset

2023

Dataset

2023

Dataset

Affina la ricerca