Transfer Learning and Text Data

Transfer learning for clustering articles from

The ongoing efforts in retro-digitisation of printed resources enable researchers from all disciplines to get access to data collections of sizes which had been unimaginable a few decades ago. But the size of these datasets turns their processing and exploration into a challenge. This is also particularly relevant for retro-digitised periodicals, journals and newspapers. New approaches for ordering these collections and facilitating information retrieval are needed.

The resources currently being digitised or already available online attract more and more attention, especially from “Digital Humanities” researchers. However, this mass digitisation also creates new demands from researchers and collection curators. That is why this project seeks to identify new ways of augmenting existing practices of cataloging and description by recent advancements in machine learning. It explores techniques which could automatically structure text data from (powered by ETH Library) into thematic clusters.

Clustering groups the articles in the collection according to their semantic content, meaning that articles which are about similar topics would be allocated to the same cluster. The project investigates whether transfer learning for textual data from for the purpose of clustering works equally well as for image data (see, e.g., PixPlot).


clustering process (image credit: P.Ströbel)


The goal is to train a feature extractor, in the form of a neural network, on newspaper articles which have been labelled with fine-grained categories. Then journal articles from are fed to the classifier, but instead of using the classifications, the features calculated by final network layer are stored for each article. A clustering algorithm then uses these extracted representations to group the articles with similar features into clusters. This is where the actual transfer happens: we use a model trained on existing data, which already has available labels, to help organise and cluster new unseen and unlabelled texts.

The structuring of a collection by clustering brings many advantages and can speed up the process of finding relevant source material. It enables researchers to identify articles similar to the ones returned by a particular query, but where the similar articles do not necessarily contain words from the original query. This is especially important in diachronic text collections, where the vocabulary changes over time.

Project Duration

1. January 2020 – 30. June 2020

Project related tags

Project Owner

Phillip Ströbel

PhD student Computational Linguistics, University of Zurich

We use cookies to help us give you the best possible user experience on our website. By continuing to browse the site you are agreeing to the use of cookies. More information about privacy can be found here.