Buscar
Resultados de la búsqueda
-
Dataset
OCR and crowdsourced annotations, Language of Mechanisation, JSON files
Datasets created through crowdsourcing tasks created on the Zooniverse crowdsourcing platform by the Living with Machines ‘language of mechanisation’ project team. Building on earlier work classifying machines by function, we asked volunteers on Zooniverse 'how did the word x change over time and place?' and presented them with options for... -
Dataset
Language of Mechanisation: annotated historical newspaper articles
Datasets created through crowdsourcing tasks created on the Zooniverse crowdsourcing platform by the Living with Machines ‘language of mechanisation’ project team. Building on earlier work classifying machines by function, we asked volunteers on Zooniverse 'how did the word x change over time and place?' and presented them with options for... -
Dataset
UK Doctoral Thesis Metadata from EThOS
The data in this collection comprises the bibliographic metadata for all UK doctoral theses listed in EThOS, the UK's national thesis service. We estimate the data covers around 98% of all PhDs ever awarded by UK Higher Education institutions, dating back to 1787. Thesis metadata from every PhD-awarding university in...British Library ; Rosie, Heather
higher education, student, UK, dissertations, PhD, theses, doctoral, ethos, thesis, and research
-
Dataset
EAP031 Catalogue Metadata
This Excel spreadsheet contains the metadata that describes the archival collection digitised in Bulgaria by the EAP031 "The Treasures of Danzan Ravjaa" project team. The metadata was originally created by the EAP031 project team that digitised the archive in 2005. The project team was led by Professor Caroline Humphrey. This...EAP031 Project Team
metadata, manuscripts, and Tibetan
-
Dataset
EAP696 Catalogue Metadata
This Excel spreadsheet contains the metadata that describes the archival collection digitised in Bulgaria by the EAP696 "Minority press in Ottoman Turkish in Bulgaria" project team. The metadata was originally created by the EAP696 project team that digitised the archive in 2014. The project team was led by Mr Stoyan...EAP696 Project Team
-
Dataset
Datasets for toponym recognition and disambiguation for nineteenth-century English newspapers
We present two datasets, one for the task of toponym recognition and one for the task of toponym disambiguation. The datasets are derived from the "Dataset for Toponym Resolution in Nineteenth-Century English Newspapers" (DOI: https://doi.org/10.23636/r7d4-kw08). The toponym recognition dataset consists of two JSON files (ner_fine_train.json and ner_fine_dev.json), whereas the toponym...Coll Ardanuy, Mariona ; Nanni, Federico
toponym disambiguation, nineteenth-century newspapers, named entity recognition, entity linking, toponym resolution, toponym recognition, and dataset
-
Dataset
DeezyMatch training set for OCR
Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o".... -
Dataset
Diachronic word embeddings from 19th-century newspapers digitised by the British Library (1800-1919)
Word vectors related to the paper "Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers" by Nilo Pedrazzini and Barbara McGillivray (2022). The embeddings were trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and specific parameters. The embeddings are divided into...Pedrazzini, Nilo ; McGillivray, Barbara
historical semantics, word-vectors, late-modern-english, newspapers, diachronic-embeddings, and word2vec
-
Dataset
Decade-level Word2Vec models from automatically transcribed 19th-century newspapers digitised by the British Library (1800-1919)
Word embeddings trained on a 4.2-billion-word corpus of 19th-century British newspapers using Word2Vec and specific parameters. The embeddings are divided into periods of ten years each. Unlike those in this repository, these were not aligned and OCR errors skimmed from the vocabulary. See related GitHub repository for the full documentation:...Pedrazzini, Nilo
historical semantics, British newspapers, word embeddings, word vectors, word2vec, and Late Modern English
- « Anterior
- Siguiente »
- 1
- 2
- 3
- 4
- 5
- …
- 25
- 26