As a group working on diverse machine learning (ML) and data science related projects, we are in the thick of developing and collaborating on projects in this field. At the same time, we keep an eye on the latest developments in the tech industry with the goal of understanding their relevance for public sector institutions like scientific libraries, collections and archives.
In this two-part article we list our top 6 ML and data trends that we have observed in the last year (after this be sure to check out trends 4-6 in part 2!). While some of the trends have been around for some time and are showing a resurgence, others are surfacing for the first time.
Note that these trends are not the result of an extensive survey or focus group, but rather our thoughts on the indicators that we observed, ranging from perspectives gained from new work practices to technological advancements. The latter oftentimes also raise ethical questions, like the rapid increase in the use of computer generated media.
Trend 1: It’s Still All About the Data
The proclamation that “data is the new oil” was a bold statement at the time (by Clive Humby back in 2006).* The follow up caveat “if unrefined, it cannot really be used” probably offers more practical guidance on a day-to-day basis; but nevertheless, the reality is that collecting data (refined, good, bad, clean, dirty or otherwise) is the first unavoidable step in a long journey to implementing more data-driven processes. Unfortunately, we have found that data collection is a step that still gets overlooked today at times. Once completed, data can be moved and stored where it’s needed, cleaned and transformed and used to add value.
You might be thinking: What’s new here? Of course generating insights from customer data is already well established, particularly in retail companies, but with the advent of the tech industry and wider adoption of ML, this practice was kicked into overdrive. For several years it seemed (probably still seems) that the objective was set firmly on vacuuming up as much user generated data as possible, feeding it into the best available algorithm and retraining the system as soon as more data has been collected.
Gathering data is still important, but recently there has been a shift to focusing on data quality over quantity, which heralds good things for smaller companies and research groups.
*Now this metaphor could be pushed even further, with new parallels to the negative effects of exploitive data collection.
Increasing Focus On Data-Centric Machine Learning
There is growing attention placed on the quality of data used to train ML models as opposed to the formulation of the model or algorithm itself. This revised focus is evidenced by several developments in the past year:
- HAI Stanford University and ETH AI Center hosted the first edition of the “Data-Centric AI Workshop” (free to view online at Data Centric AI)
- Andrew NG, possibly the most well known AI educator and practitioner, launched the Data-Centric AI campaign and new training course (Andrew Ng Launches A Campaign For Data-Centric AI)
- Google’s paper released earlier this year which details their empirical evidence of “data cascades” their term for the downstream negative effects of poor data quality in ML applications (particularly critical tasks like healthcare and environment)
Typically, data scientists would be given access to a dataset that they want to use to perform some inference or predictions on new data in the future. Quite often the original data is too large, so they would take a sample, label it (or more commonly pay to have it labelled), and divide that labelled data into sets for training, testing and validation.
From this point, data scientists can train and test many different types of ML models. With the most promising model type determined, they would often try to eke out a few more percentage points performance improvement by testing different ranges and combinations of hyperparameters (hyperparameter tuning), potentially altering the neural networks architecture slightly, testing different preprocessing parameters etc. This approach is referred to as a model-centric approach.
“Instead of just iterating on the ML model, spend more time iterating on the training data quality.”
While large tech companies typically try to compensate for noisy data by adding millions of additional training examples, the majority of use cases for machine learning actually have a relatively low number of examples available. Here the quality and consistency of the labels for this data has a major influence. Compared to optimizing the parameters of a model, optimizing the quality of the training data can often have a bigger influence on the final accuracy. This method is particularly important for crowd sourcing as well as citizen science initiatives, where volunteers may label and annotate datasets in different ways.
Labelling plant fruits: different people may have different approaches, unless there is a specific need, the most important thing is to label pictures in a consistent way. (Original image: Z-000105884 by United Herbaria Z+ZT / CC BY 4.0)
While potentially costly for large scale usage, services which automate the ML process are increasingly accessible and offer most individuals the opportunity to train a model on their own data. As more people use these tools, there will be a growing need to guide them through the process of screening their data and assisting them with common errors such as incorrectly labelled training examples, or inconsistency between different labels. As Andrew Ng puts it, data is food for machine learning models. Feeding an ML model with higher quality data is one of the important ways to ensure its performance.
In the context of systems that need to be reliable and fair, quality is not just a measure of how many values are missing or how consistent the labels are, but also a question of how appropriate the dataset is for the selected domain.
Organisations Are Giving Away Their Code… But Not Their Data
While it has become much more common for companies to release parts of their code and even ML models as open source, data is still the closely guarded resource that companies keep to themselves. Naturally, in most industries there are barriers to releasing data publicly without infringing on GDPR and privacy laws, but fundamentally there is just very little incentive to do so. Companies collect ever-increasing amounts of data on their users’ behaviors and interaction with their products because understanding the users’ wants, likes and frustrations gives them the power to finetune their product in the direction that facilitates the most growth.
Even in the public domain where public institutions are tasked with storing and providing open access to data, there is trepidation around sharing data completely openly. This is understandable as ‘giving away’ all of their carefully curated records is counterintuitive for many. Certainly it will take some time for cultures and mindsets to adapt and yet, it is a subject which we will increasingly need to address as a society.
Data as far as the eye can see. Some of the archives at ETH Library.
Perhaps improving the visibility of provenance and properly crediting data sources will make data providers more comfortable with the thought (look out for more details in our upcoming description of our ongoing Data-DJ project). In any case, what is clear to see, is that by providing fast and convenient access to data, new and previously unimagined research and tools can be realized within increasingly shorter timeframes.
The same is true in science, research and education where it is hard to find validated datasets that are both publicly accessible and useful for new research. Perhaps if there was more of an incentive for researchers to release their data earlier, before definitive conclusions have been drawn, we could see an increase in the pace of new discoveries and more collaborations. At the very least a reduction in the duplication of work seems highly likely.
Trend 2: ML Feature Management and Reuse (Adoption of the Feature Store)
Data Scientists and ML Engineers spend a lot of time transforming data from raw sources to usable inputs for their models. This process, called feature engineering, turns raw records into useful inputs and signals for training ML models. It is a vital part of the ML process and can require a lot of domain expertise and mathematical understanding.
An issue naturally arises as more groups in an organisation use ML in their work; each team begins by creating their own data extraction and feature engineering pipelines. Starting each time from the unaltered data. In recent years, organisations are making the experience that they can gain efficiency by sharing these ‘ML ready’ features and making them easily accessible across different functional teams. This reduces repetition of work and errors due inconsistencies, enabling faster experimentation cycles and reuse of the most insightful features in the ML models of multiple teams.
Feature Store in context: Built on top of existing data storage solutions, Feature Stores allow transformed ‘ml ready’ features to be stored for model training (’offline’, higher latency) or live model serving (’online’, low latency) ML Engineer Guide: Feature Store vs Data Warehouse
Known as the feature store, it is one of the latest additions to the long list of tooling used to manage the ML engineering process more effectively. Its main characteristic is the presence of two databases, optimised respectively for model training and model serving/inference. The “offline” database stores the large quantities of transformed engineered features, delivering the large amounts of data with high throughput needed for training and the ability to ‘time travel’ allowing previous states of the data to be replicated. Meanwhile the “online” database stores the smaller quantity of current data that is needed to produce predictions on live requests. This database is optimised to return the values extremely quickly to minimise the overall time needed for the model to make a prediction.
The feature store part of ML infrastructure is becoming common at digital native companies (think Uber or AirBnB) where they have invested significant amounts of time to develop their own bespoke AI tools (Michealangelo and BigHead respectively). Open source solutions are also in existence and are growing in popularity (Hopsworks for example). It may be several years before these systems are adopted in academic circles (if at all), but it certainly would have the potential to create a competitive advantage at a departmental or institutional level.
While reproducible research should be an initial goal, having the ability to access the raw data and repeat the entire training and inference processes, feature stores could be an exciting prospect for the future to facilitate more sharing of prepared/ML ready data. This could allow additional time for more experimental hypotheses, faster iteration or new combinations of inputs, all contributing to delivering new insights.
Trend 3: The importance of being able to deploy quickly
Being able to quickly deploy new projects and application features is key for organisations looking to succeed in a data-driven context. I’ll discuss how this skill relates to starting projects, growing projects and projects specific to data science.
In the early stages of a project, it is frequently difficult to effectively communicate the heart of a project. What is it really about? Why must it exist? Which problem does it solve, which need does it concretely address? Verbalizing clear and concise answers can be tricky. Even more so in the past 18+ months of two-dimensional meetings and presentations, a project’s key message can sometimes fail to translate from the slides. From personal experience, it is most refreshing when people can get hands on with an application, or at least be guided through a live demonstration. Being able to quickly deploy prototypes for this purpose leads to more engagement and makes misunderstandings explicit early on so they can be openly addressed and solved.
Testing out the features of our open image search application in the Graphische Sammlung at ETH Zürich (Maya&Daniele Fototeam, Zurich)
Some early decisions can help make this goal of faster deployment a bit easier. For instance, if you use a popular framework**, it’ll be easier to find developers to work with, and the learning curve will be easier for your existing developers. Choosing the latest framework that has a lot of media hype might seem exciting, but it’s usually not needed. The early stages are about turning an idea into something functional. But if very few people have already used the chosen framework, you risk having very little support, having to resort to spending unnecessary time on figuring out issues for yourself. This could distract you in this early stage from your original goal.
** for example, the most commonly used frameworks for frontend web development would be React, Vue and Angular, or for backend development they could be Django, Flask, Express, Gin etc. It’s like using Microsoft Word for writing a document, whatever you are trying to add to your text in terms of layout or format, it’s highly likely that someone else has already done it and has written a guide to explain the steps.
“I looked at how we were thinking about developer productivity and our environment. What are the things that can help our team move really fast and ship really fast? Because I think that is the name of the game when you’re talking about a startup. It just comes down to how you can get your code out the door as quickly as possible.”
Jill Wetzler, VP of Engineering at Pilot (from How do you select the right tech stack? | TechCrunch)
Over Focus on Optimisation and Scalability
The proliferation of open source software has done wonders for accelerating the time it takes to develop projects. However, many of the open source libraries and tools have been developed by large companies to solve their own issues. In our experience, there is often too much focus placed on making an application fast or scalable, and not enough emphasis on finding the right users who will really benefit from and want to use the application.
We cannot stress enough just how important it is to first spend time consciously deciding what you want to develop. Ask yourself: Is this just a prototype for demonstrating a concept and gauging interest or is this already something that needs to be built to be production ready from the start?
Acknowledging that the application would be significantly rebuilt if it is adopted on a larger scale, allows a lot of opportunities for reducing complexity and keeps more time for more creative development and user feedback.
Data Science Projects
Projects involving Data Science and ML methods inherently involve more uncertainty than traditional web/software development projects. In most organisations, newly created data science teams are under pressure to quickly show that the benefit that they bring is enough to justify the investment in creating a Data Science team. Making new ML backed tools available to people early helps deliver that value. After people see what is possible using one tool, it often provides some inspiration for other tools and improvements.
Making the application accessible in the early stages of a project also makes it possible to gather real user inputs. More often than not, they’ll use the application in ways you didn’t expect or the real world data differs in some way from the training data that was originally available. Encountering these issues and solving them sooner usually provides more value than additional time making incremental improvements to a model’s accuracy.
These aspects are somewhat in the control of the data scientist., Other ways to help data science deploy sooner is by removing barriers to accessing quality data in the first place. This is particularly important for publicly funded projects where budgets and time can be a more limiting factor. Solving data access and formatting issues can quickly eat up a lot of resources in both time and finances. Faster access to data helps set up new data science projects for success, freeing up time for experimentation, more thorough testing and exploration of potential future product ideas.
In countless interviews with startup founders, CEOs and tech leaders a products success is attributed to the original team’s ability to listen to potential users and react quickly by releasing improvements to their product. The now cliched mantra of the startup/disrupter space that you should “move fast and break things” sounds at this point in time immature. Of course, teams need the ability to develop things quickly. But they need to be able to fix things quickly too. Fixing things quickly (or catching mistakes with solid automated testing) can give the whole team the confidence to contribute to a new feature or fix an existing issue. Bottle necks caused by relying on a single person can be avoided. It also gives people the opportunity to be much more experimental with trying to develop new features while feeling somewhat safe that any issues they might create will be caught before the development is released.
Shipping things quickly is also a key driver that Instagram founder Kevin Systrom attributes to the platform’s success. Investing in a good test coverage allowed everyone on the early team to make changes in every part of the codebase. Public sector projects could benefit enormously from this level approach. Public projects are particularly vulnerable to being reliant on one individual. If the lead developers leave the project, having good tests would make it easier for someone new to take over the project and give them the confidence to start developing new features quickly.