As a passionate data scientist, I aim to show people in a playful way what data is and what we can do with it. This blog post gives insight on how I pursued this goal during my Innovator Fellowship at the ETH Library Lab by studying the status quo of data science in the public sector and developing a tool kit design concept for elementary school students.
The Data in Data Science
Data is everywhere. Every swipe on a touch screen, every push of a button, every message sent, and every image taken results in data being saved somewhere. This data and even more data, for example from electronic sensors, comprise the bases for machine learning and artificial intelligence. In May 2018, Forbes reported that we generate about 2.5 quintillion bytes of data every day. A US American quintillion is a one followed by 18 zeros. Hence, we refer to this as big data.
This vast amount of data in combination with increasing computational power has enabled us to compute things previously only seen in science fiction. But in order to take full advantage of this winning combination of data and computing, we must understand what it is that we are computing. We need to understand the data’s stories and know what questions we attempt to answer in order to evaluate the usefulness of calculations, visualisations, and more. This is why, in 2020, I joined the ETH Library Lab with the goal to develop a design concept for a tool kit that leverages storytelling approaches to make data science communication more intuitively understandable.
First Things First: Bursting the Target Group Bubble
Initially, the only constraint I had placed on the potential target group was for it to be from the public sector. Hence, my first task was to narrow down the search and identify a distinguishable user group. The public sector presents a great opportunity because whilst the business sector prolifically develops, employs, and sells products based on machine learning, the public sector is falling behind (see Margetts and Dunleavy, 2013). Moreover, it is the public sector that is responsible for regulating the use of technologies and educating the general public.
The public sector in Switzerland represents the second largest employer of the country and contains everything from administration, healthcare, public utility, education, court system, military, police, politics, to archives – including everything in between. Of course, every one of these sectors generates and uses data in different ways. This presented a challenge when trying to narrow down the user research. Even within one department, I noticed a lack of standardisation and continuity regarding data collection and utilisation. This would make it almost impossible to come up with a scalable and sustainable tool kit if I was to focus my design only on one specific department.
Somewhat surprisingly, I came across an unexpected user group: teachers. In 2020, the new curriculum, Lehrplan21, came into its full effect. Taking into account current developments in research, the new curriculum included the new subject ‘media and informatics’, which also puts forward data science learning objectives. Thus, Lehrplan21 conveys a continuation of a development that can already be observed in the curricula of universities. For example, machine learning methods, such as regression models, are already taught at a bachelor level across various disciplines.
But how can teachers at lower grade-levels, who did not have any specific schooling in data science, cover such learning objectives? The educational programs for elementary school teachers include mandatory and valuable courses for the teaching of media and informatic classes. Yet, due to the demanding job and the long working hours, these courses are constrained to a very limited duration. Thus, they merely focus on the basics of computer science and new media. Therefore, when I approached the program providers as well as the teachers themselves my idea of developing a tool kit to help them learn and teach data science was met with a clear interest and strong enthusiasm. For this reason and due to the apparent need and the potential of scalability, I decided to ultimately narrow down my user group to primary school teachers and their students instead of a specific department in the public sector.
To create the tool kit, I first needed to find a way to consolidate and synthesise my findings. To that end, I created a report concerning the many things I had learned about the status quo of data science understanding and applications in the public sector. In it, I expressed the vision for libraries to take their rightful place as data centres in the fast-moving age, data-driven digital world. Throughout the writing process, it became increasingly clear to me that data quality and accessibility are still two of the biggest issues concerning the utilization of big data in practice. Since collecting information, storing it, and making it accessible is precisely the expertise and responsibility of libraries, I have come to believe that libraries should adopt a more proactive approach to becoming ‘data hubs.’
Having completed this report in which I have – among other things – tried to depict the vast potential for scientific libraries in the data science landscape, my next step was to design a concept for a tool kit.
The Tool Kit: Design Concept
Have you ever noticed that – besides a few exceptions – almost all visual information entities are made up of rectangles? The examples are numerous, books, paper sheets, screens, folders, and much more. Tables represent rectangles and digital visualisations are also saved in rectangle formats. Even network graphs are most often stored as a list of edges in a two-column table, which is – again – a rectangle divided into multiple smaller rectangles. The same holds for matrixes: they are written as one big entity containing multiple elements. Interestingly, digital images share the same property. This commonality serves as the foundation of my design concept.
My idea was to start with a physical prototype, so that elementary school students get an intuition for the relationship between data and reality (see image on the left) as well as for statistical concepts such as frequency, correlation, etc. (see image on the right). Moreover, I wanted to provide them with a rudimentary understanding of data visualisation. Therefore, in alignment with the common rectangular shapes found in data science concepts, my physical prototypes consisted of a rectangular box, which was further divided into five-by-five smaller rectangular partitions. Some of the partitions are movable hatches. They can be placed up or down. When the hatches are up, the data-cubes can be placed on top of them. This allows for an intuitive understanding of scatter plots and the relationship between the x– and y-axes. Reversely, when they are down, the data-cubes can be slid down the columns, as shown in the image on the right. This feature provides intuition for histograms and frequencies. Moreover, the box can be flipped so that the empty lines are horizontal. In this case, entire rows of data-cubes can be slid into the rows, providing students with an intuition of a single sample, or entry, in a data table. Such a row entry is shown in the image on the left.
Primary school students can solve the first tasks with the physical prototype in small groups. In a second step, the prototype will be mimicked digitally. The same tasks can be solved either with the physical prototype or with the digital version. More complex problems can be approached with a digital version of the tool kit. In the digital version, the toolkit will have three windows or tabs. The basic concept of this is shown below.
The left window shows the code, the middle window displays the data table or data set, and the right window visualises the data. The information provided by these three windows is connected and synchronised such that whenever a change is made in one of the windows, the other windows change as well. These changes are then highlighted. The idea behind this mechanism is that students understand how a modification of the code causes a concrete change in the data set and its visualisation – and the other way around. However, they are not actually required to adjust the code, as every task is achievable using clicks and manual changes to the data set and the visualisation.
The tool kit provides various difficulty levels. Starting with a small amount of data and a simple data visualisation, the size of the data set grows, and the complexity of the data visualisation increases with every level. The bigger the data set and the more complex the visualisation gets, the more inconvenient the manual method becomes.
A New Approach to Data Science
In developing this tool kit, I have not pursued a purely technological approach, but have aimed to combine the theoretical and the practical aspect of data science by including physical and digital as well as social learning components. In my view, such a practical approach is crucial for us as a society to effectively understand and apply data science in our everyday lives.
In my current position as a Ph.D. fast track student at the People and Computing Lab at the University of Zurich, I aim to further develop this idea which I initiated at the ETH Library Lab. I am currently thinking about how to make the digital tasks interactive tasks among the students. For this, I very much look forward to the collaboration with teachers and students to develop a working prototype. For me as a Ph.D. student, it is exciting to see how today’s research influences tomorrow’s school curricula. Being part of this process myself and under the premise of the rapid acceleration of knowledge growth, I would even say that such teaching tools that are co-developed by domain experts will be indispensable for high-quality education in the near future. I therefore want to underline that there is a clear need for targeted support of teachers facing the challenge to transfer contemporary knowledge to the next generation.
Looking back on my experience of elaborating a data literacy tool kit, I learned some important lessons: while my previous studies focused on the theoretical aspects of data science, my fellowship at the ETH Library Lab allowed me to get a deeper understanding of data science in the “real world” and to realize what the actual needs of people – and society as a whole – are when it comes to data science. I learned that reality is messier than theory. And that decisions cannot always be explained in numbers. Having started my journey at the ETH Library Lab as a data scientist and economist, I left the ETH Library Lab as a humble qualitative researcher with big dreams of making data science accessible to everyone.