UK Selective Web Archive Classification Dataset. 1996 - 2010. TSV.


The dataset comprises a manually curated selective archive produced by UKWA which includes the classification of sites into a two-tiered subject hierarchy. In partnership with the Internet Archive and JISC, UKWA had obtained access to the subset of the Internet Archive’s web collection that relates to the UK. The JISC UK Web Domain Dataset (1996 - 2013) contains all of the resources from the Internet Archive that were hosted on domains ending in ‘.uk’, or that are required in order to render those UK pages. UKWA have made this manually-generated classification information available as an open dataset in Tab Separated Values (TSV) format. UKWA is particularly interested in whether high-level metadata like this can be used to train an appropriate automatic classification system so that this manually generated dataset may be used to partially automate the categorisation of the UKWA’s larger archives. UKWA expects that an appropriate classifier might require more information about each site in order to produce reliable results, and a future goal is to augment this dataset with further information. Options include: for each site, making the titles of every page on that site available, and for each site, extract a set of keywords that summarise the site, via the full-text index. For more information:


