Conference ItemAbstract: This paper introduces an open-source speech dataset for Yoruba — one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus.
Gutkin, Alexander; Demirşahin, Işın; Kjartansson, Oddur; Rivera, Clara; Túbọ̀sún, Kọ́lá
Conference ItemAbstract: Building our discipline has been an ongoing discussion since the early days of ICIDS. From earlier international joint efforts to integrate research from multiple fields of study to today’s endeavours by researchers to provide scholarly works of reference, the discussion on how to continue building Interactive Digital Narratives as a discipline with its own vocabulary, scope, evaluation and methods is far from over. This year, we have chosen to continue this discussion through a panel in order to explore what are the epistemological implications of the multiple disciplinary roots of our field, and what are the next steps we should take as a community.
Bernstein, Mark; Palosaari Eladhari, Mirjam; Koenitz, Hartmut; Louchart, Sandy; Nack, Frank; Martens, Chris; Rossi, Giulia Carla; Bosser, Anne-Gwenn; Millard, David E.
Conference ItemAbstract: We present DeezyMatch, a free, open-source software library written in Python for fuzzy string matching and candidate ranking. Its pair classifier supports various deep neural network architectures for training new classifiers and for fine-tuning a pretrained model, which paves the way for transfer learning in fuzzy string matching. This approach is especially useful where only limited training examples are available. The learned DeezyMatch models can be used to generate rich vector representations from string inputs. The candidate ranker component in DeezyMatch uses these vector representations to find, for a given query, the best matching candidates in a knowledge base. It uses an adaptive searching algorithm applicable to large knowledge bases and query sets. We describe DeezyMatch’s functionality, design and implementation, accompanied by a use case in toponym matching and candidate ranking in realistic noisy datasets.
Hosseini, Kasra; Nanni, Federico; Coll Ardanuy, Mariona
Conference ItemAbstract: This paper describes the creation of the Interactive Narratives collection in the UK Web Archive, as part of the UK Legal Deposit Libraries Emerging Formats Project. The aim of the project is to identify, collect and preserve complex digital publications that are in scope for collection under UK Non-Print Legal Deposit Regulations. This article traces the process of building the Interactive Narratives collection, analysing the different tools and methods used and placing the collection within the wider context of Emerging Formats work and engagement activities at the British Library.
Clark, Lynda; Rossi, Giulia Carla; Wisdom, Stella
Conference ItemAbstract: The MicroPasts project is a novel experiment in the use of crowd-based methodologies to enable participatory archaeological research. Building on a long tradition of offline community archaeology in the UK, this initiative aims to integrate crowd-sourcing, crowd-funding and forum-based discussion to encourage groups of academics and volunteers to collaborate on the web. This paper will introduce MicroPasts, its aims, methods and initial results, with a particular emphasis on project evaluation. The evaluative work conducted over the first few months of the project already demonstrates the potential for crowd-sourced archaeological 3D modelling, especially amongst younger audiences, next to more traditional kinds of crowd-sourcing such as transcription. It has also allowed a comparative assessment of different methods for sustaining contributor participation through time and a discussion of their implications for the sustainability of the MicroPasts project and (potentially) other archaeological crowd-sourcing endeavours.
Bonacchi, Chiara; Bevan, Andrew; Pett, Daniel; Keinan-Schoonbaert, Adi
Conference ItemAbstract: This panel will present and discuss different eBook workflows and challenges from four national libraries, considering a range of issues from technical complexities to evolution of the content type and changes in the publishing/collecting landscape.
Owens, Trevor; Pennock, Maureen; Smyth, Tom; Steinke, Tobias
Conference ItemAbstract: The dawn of Trustworthy Digital Repository Certification under the ISO 16363:2012 standard is on the horizon. Across the digital preservation community, institutions are eager to learn more about the processes of preparing for and undergoing an ISO 16363 audit from an accredited third-party organization. As the first ISO 16363 audits in the world have been performed, repositories want to learn value and benefit that certification provides. This panel features representatives from three different repositories representing three countries with distinct collections, designated communities, organizational infrastructures, and unique challenges. Institutions represented on the panel have either recently achieved certified or are currently undergoing an ISO 16363 audit. This panel will explore each repository’s experience during, leading up to, and following certification. The panel will include a representative from the accredited external auditing body who has performed these audits to respond to audience questions about the audit process. Panelists from repositories will present varying perspectives on the future of digital repository certification, the role of digital preservation standards, and approaches to implementation. All panelists will present arguments, concerns, and criticisms regarding the ISO 16363 standard and existing methods of repository assessment.
Giaretta, David; LaPlant, Lisa; Shiers, Jamie; Tieman, Jessica; Pennock, Maureen; Zuberi, Ifan
Resolving places, past and present: toponym resolution in historical British newspapers using multiple resourcesAbstract: Newspapers and their metadata are richly geographical, not only in their distribution but also their content. Attending to these spatial features is a prerequisite in newspaper research. Following other projects to have geoparsed place names in newspapers, we describe our approach to linking historical geospatial information in text to real-world locations which 1) adopts an expansive definition of what counts as a locatable entity; 2) uses knowledge bases derived from contemporaneous sources; and 3) leverages contextual information to disambiguate hard-to-locate places. This method depends on combining historical and non-historical resources and the paper discusses the potential benefits for humanities research.
Coll Ardanuy, Mariona; McDonough, Katherine; Krause, Amrey; Wilson, Daniel C S; Hosseini, Kasra; van Strien, Daniel
Conference ItemAbstract: In this paper, we describe a new collaborative approach to the collection of representation information to ensure long term access to digital content. Representation information is essential for successful rendering of digital content in the future. Manual collection and maintenance of RI has so far proven to be highly resource intensive and is compounded by the massive scale of the challenge, especially for repositories with no format limitations. This solution combats these challenges by drawing upon the wisdom and knowledge of the crowd to identify online sources of representation information, which are then collected, classified, and managed using existing tools. We suggest that nominations can be harvested and preserved by participating established web archives, which themselves could obviously benefit from such extensive collections. This is a low cost, low resource approach to collecting essential representation information of widespread relevance.
Pennock, Maureen; Jackson, Andrew N.; Wheatley, Paul