The resources currently being digitised or already available online attract more and more attention, especially from “Digital Humanities” researchers. However, this mass digitisation also creates new demands from researchers and collection curators. That is why this project seeks to identify new ways of augmenting existing practices of cataloging and description by recent advancements in machine learning. It explores techniques which could automatically structure text data from e-periodica.ch (powered by ETH Library) into thematic clusters.
Clustering groups the articles in the collection according to their semantic content, meaning that articles which are about similar topics would be allocated to the same cluster. The project investigates whether transfer learning for textual data from e-periodica.ch for the purpose of clustering works equally well as for image data (see, e.g., PixPlot).
The goal is to train a feature extractor, in the form of a neural network, on newspaper articles which have been labelled with fine-grained categories. Then journal articles from e-periodica.ch are fed to the classifier, but instead of using the classifications, the features calculated by final network layer are stored for each article. A clustering algorithm then uses these extracted representations to group the articles with similar features into clusters. This is where the actual transfer happens: we use a model trained on existing data, which already has available labels, to help organise and cluster new unseen and unlabelled texts.
The structuring of a collection by clustering brings many advantages and can speed up the process of finding relevant source material. It enables researchers to identify articles similar to the ones returned by a particular query, but where the similar articles do not necessarily contain words from the original query. This is especially important in diachronic text collections, where the vocabulary changes over time.
1. January 2020 – 30. June 2020