Index Catalog // British Library

2023

Dataset

DeezyMatch training set for OCR

Optical character recognition (OCR) is the process of automatically transcribing text from images. The presence of OCR-induced errors in digitised text is a common problem in the digital humanities. OCR errors are usually due to the misrecognition of characters, such as "h" recognised as "b", or "c" recognised as "o"....

Coll Ardanuy, Mariona ; Nanni, Federico ; Pedrazzini, Nilo

OCR, fuzzy string matching, string variation, newspapers, digital humanities, natural language processing, DeezyMatch, and Living with Machines

2023

Dataset

Datasets for toponym recognition and disambiguation for nineteenth-century English newspapers

We present two datasets, one for the task of toponym recognition and one for the task of toponym disambiguation. The datasets are derived from the "Dataset for Toponym Resolution in Nineteenth-Century English Newspapers" (DOI: https://doi.org/10.23636/r7d4-kw08). The toponym recognition dataset consists of two JSON files (ner_fine_train.json and ner_fine_dev.json), whereas the toponym...

Coll Ardanuy, Mariona ; Nanni, Federico

toponym disambiguation, nineteenth-century newspapers, named entity recognition, entity linking, toponym resolution, toponym recognition, and dataset

2021

Conference paper (published)

When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation

As languages evolve historically, making computational approaches sensitive to time can improve performance on specific tasks. In this work, we assess whether applying historical language models and time-aware methods help with determining the correct sense of polysemous words. We outline the task of time-sensitive Targeted Sense Disambiguation (TSD), which aims...

Beelen, Kaspar ; Nanni, Federico ; Coll Ardanuy, Mariona ; Hosseini, Kasra ; Tolfo, Giorgia …

2020

Research report

Data Study Group Final Report: Smart monitoring for conservation areas

WWF (World Wide Fund for Nature) monitors over 250,000 protected areas (e.g. national parks and nature reserves) and thousands of other sites and critical habitats. These sites are the foundation of global natural assets and are central to the preservation of biodiversity and human well-being. Unfortunately, they face increasing pressures...

Hosseini, Kasra ; Coll Ardanuy, Mariona ; Patterson, David ; Garcia-Velez, Laura ; Castro-Gonzalez, Leonardo …

neural networks, supervised learning, natural language processing, WWF, conservation, habitats, Data Study Groups, and Alan Turing Institute

2020

Abstract

Using smart annotations to map the geography of newspapers

Geographic information is a key component in the description of collection objects, and yet its format is often unsuited for use with methods of geographic analysis. Catalogue entries are often inconsistent, in plain text, and without geographic coordinates (much less coordinates linked to authority records). Georesolution of the relevant fields...

Ryan, Yann ; Coll Ardanuy, Mariona ; van Strien, Daniel ; Hosseini, Kasra ; Beelen, Kaspar …

2019

Poster (published)

Living with Machines - British Newspaper Titles vs Newspaper Press Directory titles

Ahnert, Ruth ; Beavan, David ; Beelen, Kaspar ; Coll Ardanuy, Mariona ; Griffin, Emma …

2019

Poster (published)

Living with Machines - Agency of Machines

Ahnert, Ruth ; Beavan, David ; Beelen, Kaspar ; Colavizza, Giovanni ; Coll Ardanuy, Mariona …

lexicon expansion and Living with Machines

2019

Poster (published)

Living with Machines - Computer-detected text in historical maps

Ahnert, Ruth ; Beavan, David ; Beelen, Kaspar ; Coll Ardanuy, Mariona ; Griffin, Emma …

maps and Living with Machines

2020

Conference paper (unpublished)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of...

van Strien, Daniel ; Beelen, Kaspar ; Coll Ardanuy, Mariona ; Hosseini, Kasra ; McGillivray, Barbara …

Natural Language Processing, OCR, Optical Character Recognition, information retrieval, NLP, and digital humanities

2019

Conference paper (unpublished)

Defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data

This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations....

Filgueira, Rosa ; Jackson, Michel ; Terras, Melissa ; Roubickova, Anna ; Beavan, David …

High-Performance Computing, digital tools, Apache Spark, distributed queries, digitised primary historical sources, XML schemas, text mining, and humanities research

Research Repository

2023

Dataset

DeezyMatch training set for OCR

2023

Dataset

Datasets for toponym recognition and disambiguation for nineteenth-century English newspapers

2021

Conference paper (published)

When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation

2020

Research report

Data Study Group Final Report: Smart monitoring for conservation areas

2020

Abstract

Using smart annotations to map the geography of newspapers

2019

Poster (published)

Living with Machines - British Newspaper Titles vs Newspaper Press Directory titles

2019

Poster (published)

Living with Machines - Agency of Machines

2019

Poster (published)

Living with Machines - Computer-detected text in historical maps

2020

Conference paper (unpublished)

Assessing the Impact of OCR Quality on Downstream NLP Tasks

2019

Conference paper (unpublished)

Defoe: A Spark-Based Toolbox for Analysing Digital Historical Textual Data

Limite su búsqueda

Type

Resource Type

Creator

Palabra clave

Idioma

Colección

Institution

Availability

Research Repository

Buscar

Resultados de la búsqueda

2023

Dataset

2023

Dataset

2021

Conference paper (published)

2020

Research report

2020

Abstract

2019

Poster (published)

2019

Poster (published)

2019

Poster (published)

2020

Conference paper (unpublished)

2019

Conference paper (unpublished)

Limite su búsqueda