Research Repository

Working paper

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

上市 Deposited

Creator

McMillan-Major, Angelina
Alyafeai, Zaid
Biderman, Stella
Chen, Kimbo
De Toni, Francesco
Dupont, Gerard
Elsahar, Hady
Emezue, Chris
Fikri Aji, Alham
Ilic, Suzana
Khamis, Nurulaqilla
Leong, Colin
Masoud, Maraim
Soroa, Aitor
Suarez, Pedro Ortiz
Talat, Zeerak
van Strien, Daniel ( )
Jernite, Yacine

Abstract

In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.

Items:

缩图	文件名	上载日期	能见度	File Size	动作
	2201.10066.pdf	2022-04-13	上市	330 KB	Download Download (as thumbnail)

Metadata

Resource Type: Working paper
Creator: McMillan-Major, Angelina

Alyafeai, Zaid

Biderman, Stella

Chen, Kimbo

De Toni, Francesco

Dupont, Gerard

Elsahar, Hady

Emezue, Chris

Fikri Aji, Alham

Ilic, Suzana

Khamis, Nurulaqilla

Leong, Colin

Masoud, Maraim

Soroa, Aitor

Suarez, Pedro Ortiz

Talat, Zeerak

van Strien, Daniel ( )

Jernite, Yacine
Institution: British Library
Organisational unit: Digital Scholarship
Pagination: 1-11
Official URL: https://arxiv.org/abs/2201.10066
Licence: CC BY 4.0 Attribution
DOI: 10.48550/arXiv.2201.10066
Alternate identifier: identifier: hal-03550289 , version 1

type: HAL Id

identifier: 2201.10066

type: ARXIV
关键词: Applications
Collaborative Resource Construction & Crowdsourcing
LR Infrastructures and Architectures
Systems
Tools