Many scientists and researchers have experienced a situation where finding, accessing and pre-processing the data needed for their research was a lot more time consuming than anticipated. This is a challenge not unique to scientific research environments and has been described by data science experts for almost a decade.
Looking at this issue from a more individual perspective, the experiences from Phillip Ströbel, PhD student in Computational Linguistics at University of Zurich and former Innovator Fellow at ETH Library Lab, can serve as a representative example. He refers to the frictions and barriers he encountered regarding data access as a “data dilemma”.
The Data Dilemma – A User Story
In 2020, Phillip Ströbel was working on the development of new access methods to the journals of e-periodica – a platform of the ETH Library that provides access to digitised Swiss journals in several languages. The aim of Phillip’s fellowship was to perform thematic clustering to enrich the search facets in the e-periodica collection. He says:
I was very excited to be able to work with the data. The only question that remained difficult to answer was: how will I access it? Of course, I could browse the e-periodica online, and I could even download articles or even all the issues. But crawling the whole collection, downloading the PDFs, and extracting the text was out of the question, since – and of that I felt sure – the data that is available online must be hosted somewhere in some sensible format.
However, I received a physical copy of all the data on a hard disk about 1.5 weeks after I had started the fellowship. On there, each publication is stored in a folder, where for each year (or a longer period), an XML file holds all the data about issues, a separate JSON file provides the metadata about a specific journal title and several text files contain the text along with word coordinates. It seemed overly complicated, and I was wondering why that might be? The digitisation of e-periodica had progressed a lot in the past years; more and more data was acquired and finally made public on the e-periodica platform. But why is there no easy way to access the sources?
With libraries being more and more occupied with the digitisation of their material, we would assume that accessing the digitised data would get easier as well. Evidence for this is the growing number of freely available datasets on opendata.swiss. There, you can find 82 organisations which provide access to their data, among which you can find text data, images, collections of objects, or numerical datasets (e.g., data about the mortality rate of people who smoke). The government is what I would call a “top data provider” (alone the Federal Statistical Office has published 4,171 datasets (as of 13 July 2020)), while for libraries, we probably only see the tip of the iceberg (the ETH Library, e.g., lists 15 (!) datasets on opendata.swiss). But should libraries as knowledge management institutions not be the main data providers for researchers, at least in the Digital Humanities?
Building Information Infrastructure – The Story of e-periodica.ch
At this point, let us shift perspective and take a look at the reasons for building e-periodica.ch and path dependencies on the way to today’s platform. The platform was created as an initiative to unite the interests of libraries and archives: digitising their collections and providing them online. At ETH Library, the group responsible for the digitisation of its material is the DigiCenter. It was founded in 2008, starting the digitisation of the ETH Library’s material with two scanners. Today, the DigiCenter operates on a much larger infrastructure with 40 employees (most of them student assistants) and 16 scanners all working towards 10 digitisation projects.
The evolution of the DigiCenter mirrors the technological development of digitisation as well as the increasing need for it. Back in 2012, Prof. Dr. Stefan Gradmann, former manager of the University Library at KU Leuven, already expressed the notion that the libraries of the future have to reorient themselves if they want to prevent being repressed by other corporate information providers. Gradmann also stated that libraries should establish themselves as partners for researchers.
This is very easily said but poses great challenges for libraries. Regina Wanger, Head of DigiCenter at ETH Library, says: “When we founded the DigiCenter and started to digitise our material, the conditions were very different from today. And so were users’ needs and, therefore, our goals.” In the beginning, as users sought to just read or view the information, the goal was “simply” to make the material available online – so that even rare or valuable pieces can be easily accessed from home.
Over time, users increasingly wanted to work with the data. “We receive an increasing number of inquiries concerning the use of images – which we provide, but it is a little more complicated than just making documents available because we have to clarify the copyrights.”, says Regina Wanger. Moreover, in the past few years, some researchers requested using data corpora. Among them Phillip Ströbel. “For now, we address these requests individually since there is no defined procedure for this – yet.” The increasing involvement of users has therefore become a strategic goal for the ETH Library Lab. A project is currently being launched to simplify data access with the involvement of users.
Disruption requires time and infrastructure. (Schäfer 2021)
This change in user needs creates a scaling problem. As described by Borgman et al. (2015): “Managing research data is difficult, and making research data useful to unknown others, for unanticipated purposes, is far harder.” While researchers are approaching the limits of available tools and resources, libraries are struggling to provide the necessary infrastructure. At this point, it is important to take into account that infrastructure develops a lot more slowly than the technological innovations themselves.
Very much in line with the Silicon Valley approach “fail fast, fail often, fail forward”, the development of digitisation and its tools has accelerated enormously in the past decades (Schäfer 2021). However, sustainable innovation requires time and infrastructure. As information processors, libraries are expected to provide at least a part of this infrastructure. But – as Bowker et al. (2010, 103) put it – “when dealing with information infrastructures, we need to look to the whole array of organisational forms, practices, and institutions that accompany, make possible, and inflect the development of new technology, their related practices, and their distributions.” Providing the necessary infrastructure can therefore not only be seen as a task for one specific institution – namely libraries – but is thought to be a much broader, societal change developing slowly over time. Nevertheless, it is important for libraries to assume their role in this process in order to ensure meaningful, enduring and equitable access to information for all.
Transformative Innovation – Making Things Work, Together
At the same time, despite those challenges, the digitisation of data and its sources offers great opportunities for libraries. As they are most likely the institutions holding the largest data collections, their potential to make data digitally available is huge. Looking to increasingly position themselves on this topic, the ETH Library is currently working on a new project called “e-periodica – next level access”, which uses automated text enrichment and named entity recognition. “This offers new opportunities to shift focus and exploit the data to the fullest”, says Regina Wanger.
Getting back to Phillip Ströbel’s question about whether libraries should be the main data providers for researchers, Regina Wanger answers: “Yes, I completely agree. It certainly is the task of libraries and archives to make data available and to prepare it in the right format. But it is important to take into account the many hurdles that we still encounter.” Besides the fast pace of the development of the technological tools and the related change in user needs, hurdles also include the lack of a legal framework as well as constraints in resources. “These are factors that often are forgotten. Users tend to assume that everything that is online is free.”, says Regina Wanger.
However, these hurdles are being increasingly identified and the focus is shifting onto the users’ side and needs. “The time has come for libraries to change their role as the gatekeepers of knowledge to the co-creators of knowledge.”, says Phillip Ströbel. But this is not a task for libraries alone – researchers also have to contribute to this process. Most importantly, communication between libraries and researchers has to intensify so that libraries can comprehend and address the problems encountered by researchers more effectively. “Researchers should take projects like “e-periodica – next level access” as opportunities to make their voices heard.”, says Phillip Ströbel. Such projects offer the ideal platform to bring librarians and researchers together in order to create a product from which everyone can benefit. Examples like “e-periodica – next level access” prove that libraries are open to change. Researchers should embrace this openness to help reshaping the “knowledge environment” in libraries, collections and archives alike.
Borgman, CL, Darch, PT, Sands, AE, Pasquetto, IV, Golshan, MS, Wallis, JC, & Traweek, S. (2015). Knowledge infrastructures in science: data, diversity, and digital libraries. International Journal on Digital Libraries, 16(3-4), 207-227. http://dx.doi.org/10.1007/s00799-015-0157-z Retrieved from https://escholarship.org/uc/item/32q2z1c9
Bowker G.C., Baker K., Miller & F., Ribes D. (2009) Toward Information Infrastructure Studies: Ways of Knowing in a Networked Environment. In: Hunsinger J., Klastrup L., Allen M. (eds) International Handbook of Internet Research. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9789-8_5
Press, Gil. (2016) Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says. Forbes. https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/?sh=3fcc2d3f6f63
Schäfer, Rene. (2021) Das Paradox der Innovation — Warum Innovation längst nicht mehr das ist, was sie einmal war. Medium. https://medium.com/competence-center-for-transformative-innovation/das-paradox-der-innovation-warum-innovation-nicht-mehr-das-ist-was-sie-einmal-war-fe96322a0e22
Umlauf, K & Gradmann, S. (2012) Die Bibliothek der Zukunft. In: Handbuch Bibliothek, 387-397, J.B. Metzler, Stuttgart. https://link.springer.com/book/10.1007/978-3-476-05185-1