In case you missed it, be sure to check out part 1 of our series in which we have discussed the importance of data quality, the growing usage of machine learning feature stores, and why it is essential to be able to deploy applications quickly. In this article we will discuss the remaining three trends from our top six machine learning and data trends of 2021 and explore the opportunities and challenges they present to scientific libraries.
Trend 4: Managing (and Automating) the Machine Learning Pipeline Is Improving, but It Still Needs Some Work
After the initial explosion of interest in data science, the field is starting to mature. In particular, the tooling used in the profession has improved and we have also observed the emergence of more standardised workflows and best practices over the last year. Nonetheless, reproducing the machine learning process from end-to-end remains to present some challenges in practice.
In comparison to traditional software development, reproducing the ML process from end-to-end is challenging because the output does not only depend on the written code, but also on the state of the data when the code was run. This additional layer of complexity means that a change to either part, the code or the data, is almost certain to cause a difference in results.
Machine Learning = CODE + DATA
Managing the cycle of updating the data, the code, and the models takes up a large portion of the time in ML projects. Evidenced by the large difference in effort required by training a model once, compared to creating a system that can ingest new data, retrain and deploy new models, and maintain reproducibility along the entire process.
Elements of the ML engineering process for an established or well supported team’s application. Note that model training is just one step!”, image credit: A.Burkov fig:1.4 ‘Machine Learning Engineering’ 2020
Fortunately, many different enterprise tools are emerging to help with this process, and they are getting easier to use for non-experts and also offer faster prototyping for experts. The downside, however, is the potential expense, particularly for those who are new to cloud computing services.
One of the Library Lab’s projects A.I.D has been developing tools to make it easier for developers/non-experts to find and implement machine learning models in their applications and for data scientists to easily package and share their trained models. Facilitating this process is an exciting prospect for scientific libraries as this could allow researchers to access and quickly apply existing machine learning models in their field without getting bogged down in the less interesting implementation details.
Trend 5: Interest in AI Continues to Grow
It’s no surprise, but the interest and adoption of AI technologies continues to grow. As AI and ML becomes easier to implement, and people become more comfortable with the topic, many institutions and companies are already using or planning to use AI in some way in the next year.
While there is a still a large skills gap in the employment market, there is growth in the education and research sector trying to catch up with the demand. This can be seen at ETH Zurich for one, where the number of students enrolling in machine learning related topics has sharply increased over the past years. At the same time there is at least one professor from each of the university’s 16 academic departments affiliated with the recently founded ETH AI Center. This supports the fact that AI is now being used as a core part of research across practically every discipline.
Uptake in Artificial Intelligence courses at ETH Zurich (source: ETH AI Center)
In such a young and rapidly developing field, it is only in the last couple of years that the combination of achievability and reliability have reached a point where small teams or even individuals can develop ML-backed applications by themselves. This is a sea change to the past, when the sheer workload required to bring the technology online placed such projects out of reach for many organisations.
Hinting to the maturing nature of the field, for the first time this year several image classification models achieved a top-1 accuracy score of over 90% on the benchmark imagenet dataset (ImageNet Benchmark (Image Classification) | Papers With Code). When it comes to applying machine learning systems in practice there is a tipping point that needs to be reached before the ML system is sufficiently reliable or consistent to be used in a practical way. The recent improvements in accuracy signal that ML systems have now passed this point in many industries and it is no longer only the early adopters and experimenters who are incorporating ML into their business processes, but it is now ready for more widespread adoption.
Collaboration between academia and industry is particularly important in the field of AI where the pace of change is fast and potential research topics often arise from practical application. In Switzerland in 2019, AI publications involving academic-corporate partnerships have had a relatively high impact; being cited on average more than twice as much as is typically expected in the field. Given the efforts in the last year to strengthen industry relationships at ETH Zurich and others, the importance of these collaborations will likely to continue to grow. Indeed given the number of professors and PhD students moving from academia to industry, the line between the two will become more blurred.
“In 2019, 65% of graduating North American PhDs in AI went into industry—up from 44.4% in 2010, highlighting the greater role industry has begun to play in AI development.”
Peer-reviewed AI publications’ field-weighted citation impact (FWCI)** and number of academic-corporate peer-reviewed AI publications (source: Elsevier/Scopus, 2020, chart: 2021 AI Index Report)
** Field-Weighted Citation Impact is the ratio of the total citations actually received by the denominator’s output, and the total citations that would be expected based on the average of the subject field”
What is Field-weighted Citation Impact (FWCI)? – Scopus: Access and use Support Center (elsevier.com)
Trend 6: Everything is Generative
One of the key takeaways from this year’s AI Index report from the Stanford Institute for Human Centred AI is that the generation of synthetic media, also referred to as deepfakes, is on the rise. Moreover, in many sectors it has now reached a performance level where most humans have difficulty differentiating between computer generated synthetic media and media created by humans’ images [AI INDEX].
When asked to comment on the topic of “the rise of synthetic media in online content”, one expert with first hand experience on the topic provided this view;
Today, synthetic media are used to support a wide range of applications, from mass corporate communications presentations to one-to-one interactions with artificial intelligence-powered virtual assistants. Companies use a combination of artificial intelligence and machine learning techniques to create synthetic voices and videos, including generative adversary networks (GANs). Using the same GANs to detect fake images based on synthetic video image training data generated with existing tools (such as FaceForensics database or others). [Sources: InterDigital & Plug and Play, Kyle Wiggers, Falk Rehkopf]
The most efficient GANs can create lifelike portraits of non-existent people or even imaginary apartment buildings. For example, if you have an iPhone and use its portrait mode regularly, it basically creates synthetic images that mimic how your real photos would look if they were taken with a more powerful camera. [Sources: Kyle Wiggers, Falk Rehkopf]
To illustrate the point, note that the “expert” cited above was actually a GAN, which automatically generated the preceding text when we simply provided the topic as an input prompt. With the quality of synthetic media like this improving so rapidly, it seems very likely that we will be exposed to this automatically generated content more and more frequently. This is likely to compound the difficulties of navigating an online landscape that is already strewn with disinformation and campaigns that seek to intentionally mislead people. To combat this trend there is a growing field of research that aims to develop capabilities of detecting synthetic media for the purpose of flagging the content and making consumers aware of the media’s origin.
Not everything related to generative technologies is negative however, as the recent improvements can also be applied in many promising areas such as facilitating faster drug discovery and even generating synthetic data that is cleaner and fairer which can be used to train models more effectively.
Capabilities and expectations around Generative AI are on the rise (Gartner 2021, Gartner Identifies Key Emerging Technologies Spurring Innovation Through Trust, Growth and Change)
What Could These Trends Mean for Scientific Libraries?
Our trend observations show that there is an increasing need to (1) iterate more quickly (both shorter publication cycles and product updates), (2) have access to high quality data, and (3) audit the entire software development and machine learning processes. So what does that mean for scientific libraries and their offerings?
The focus on collecting high quality information and data has been present in libraries, collections and archives for centuries. To this day, the collected objects and related information are in these libraries and very often they have by now been turned into a data corpora thanks to extensive digitisation efforts. Moreover, the process of ‘pivoting’ from maintaining physical artefacts to also creating and maintaining digital infrastructure for data access has been underway for some years.
To capitalize on these past accomplishments and the growing demand for novel data, libraries could aim to become ‘first-class’ providers of quality data and align their interests with ML researchers. In particular, there remains still a lot of work to be done to improve and update common issues around findability, explorability, accessibility (being able to get the data in a practical format) as well as reusability (federating, machine-readable formats etc.). By engaging with potential users and directing efforts into this area, libraries could carve out a key role for themselves in the age of data, machine learning and AI.
In the last decade many public institutions have done a lot of work to improve access to data, particularly numerical and text-based metadata. If libraries are to assume a stronger role in the data, ML and AI space going forward, they will need to figure out how to cater to a new customer bases that has an increased demand for not only text based query results, but for access to files themselves. This can be attributed to the growth in popularity in data science methods which often use workflows that start with sets of files (CSVs, JPGs, etc.), especially in the early phases of a project.
“Data science workflows are typically built to consume files, they are not usually built on SQL queries.”
“Welcome to Lakehouse” Databrew Podcast Season 1 episode 2, Databricks
Libraries should take note of the fact that while data cleaning and preparation is often regarded by data and machine learning scientists as the least desirable activity – with much more interest and glamour attributed to ML model design and training – it is in many ways the most important. This is a gap that could be suited to the existing skillsets and motivations that are present in librarians. With their natural tendency towards detail orientation, their focus on accuracy of information as well as their ability to spot and correct errors, it seems to make perfect sense for them to hone in on data cleaning and preparation.
The expectations around access to data have completely changed in the past couple of decades. From physical access to digital access online, and from manually viewing individual records to being able to parse large corpora of cleaned data that’s ready to deliver insights. Moreover, faster deployments, easier data access and more ML experimentation are the ingredients for the recipe of “More Show, Less Tell”. From our experience and also having listened to the founding stories of successful projects this recipe leads to much more engagement. You can make the prototype, put it in people’s hands and find out what they think in the timeframe of weeks rather than months and years. So in this spirit, we hope librarians, collection holders and managers will move quickly to improve access to their accurately digitised and curated datasets. This will be one of the keys to unlocking new innovations in the library of the future.
 “Welcome to Lakehouse” Databrew podcast Season 1 episode 2, Databricks
 Ville Tuulos and Hugo Bowne-Anderson MLOps and DevOps: Why Data Makes It Different – O’Reilly (oreilly.com)