Conference ItemAbstract: This paper presents an objective comparative evaluation of page analysis and recognition methods for historical documents with text mainly in Bengali language and script. It describes the competition rules, dataset, and evaluation methodology. Results are presented for five methods - three submit-ted, one re-run, and one open source state-of-the-art system. The focus is on optical character recognition (OCR) performance. Different evaluation metrics were used to gain an in-sight into the algorithms, including new character accuracy metrics to better reflect the difficult circumstances presented by the documents. The results indicate that deep learning approaches are promising, but there are still significant challenges for historic material of this nature.
Clausner, Christian; Antonacopoulos, Apostolos; Derrick, Tom; Pletschacher, Stefan
Conference ItemAbstract: The British Library is the national library of the United Kingdom and holds over 150 million items with an additional three million new items added each year. The 625 km of shelving contains manuscripts, maps, newspapers, magazines, prints, drawings, music scores and patents. The fundamental purpose of the Library is to make intellectual heritage accessible to everyone, for research, inspiration and enjoyment. It is therefore imperative that these texts are legible.Three instances of text illegibility were identified as requiring technological help beyond traditional digitization; (i) Fire-damage rendering text and illuminations charred and sometimes shrunken or skewed, (ii) Indistinct areas of text which have either degraded naturally or been purposefully erased, and (iii) Chemical damage usually as a consequence of historical treatments to recover faded text, evidenced by reams of discolored and stained folios.The sheer variety of materials and implements used for writing and drawing, combined with varying degrees of an item's condition, presents difficulties in optimally imaging these items for legibility. In 2013, a new role of Conservation Research Imaging Scientist was created to develop strategies and solutions to overcome these challenges. Multi-spectral imaging was subsequently adopted by the Library and has proven to be an invaluable aid to scholars and researchers. Combined with post-processing imaging techniques such as Principal Component Analysis and Color Space Analysis, multi-spectral imaging has delivered spectacular results.Future work aims to integrate techniques such as Reflectance Transformation Imaging while ensuring improved access to scientific datasets to complement existing digital collections. © 2018 IEEE.
Implementing Digital Preservation Strategy: Developing content collection profiles at the British LibraryAbstract: The British Library is increasingly a digital library. Through both digitization and acquisition, it has built up significant collections of digital content covering a very wide range of content types. Most recently, the extension of legal deposit provisions to non-print works in 2013 has meant that it - working in conjunction with the other UK legal deposit libraries - has begun to collect new categories of digital content, including periodic harvests of the UK Web domain. In order to support this, the Library has also invested heavily in developing scalable infrastructures for the acquisition, storage and management of large amounts of digital content. The British Library Digital Preservation Strategy, 2013-2016 is focused on the embedding of digital sustainability as an organizational principle across the Library and to help manage preservation risks and challenges across all digital collection content lifecycles. This practice paper describes work being undertaken by the Digital Preservation Team at the British Library to develop content profiles of high-level digital collections that will support the implementation of the strategy, in particular for the capture of long-term preservation requirements.
Day, Michael; McDonald, Ann; Pennock, Maureen; Kimura, Akiko