Identifying new scientific datasets can be a time consuming, but necessary, step to perform complex machine learning tasks. Why not apply machine learning methods to automate the indexing of datasets itself?

For many people, the term “machine learning” doesn’t convey much other than some mystical black box that will magically solve whatever problem you give it. To clarify, machine learning is a massive field, but what the media calls machine learning is more-often-than-not referring to artificial neural networks, computing systems inspired by biological learning mechanisms.

Computer simulation of the branching architecture of the dendrites of pyramidal Neurons (image: Hermann Cuntz)

Computer simulation of the branching architecture of the dendrites of pyramidal Neurons
(image: Hermann Cuntz)

At its core, machine learning is basically solving thousands or millions of simple equations, but these equations require data to solve. Collecting this data is both time consuming and incredibly tedious but one can get around this by looking for data which has already been collected by someone else. Now, this may seem trivial since Google is readily available, but surprisingly it is not that easy. In the research community, datasets gain popularity through mostly word of mouth with several small, specific indexes available online (see YACVID or OpenCV). These indexes, while incredibly helpful, are maintained and updated by hand, and the word-of-mouth issue still stands. But there has to be an easier way.

The Problem Is Part of the Solution

Having recently joined the ETH Library Lab team, I thought it would be the perfect opportunity to solve this problem that’s been bothering me for a while. Currently I am enrolled in the ETH Robotics Master’s program where I’ve been experimenting with computer vision research. One thing I have learned is that no matter how good your idea is, if you’re on a deadline you more than likely won’t have the time to go out and collect the data required, so you better already know a dataset you can use. But I’ve been thinking, can you use machine learning to automatically find datasets online?

This is the project I have been working on for the past month: using natural language processing to understand whether an academic paper points towards a new dataset. The fundamentals of the algorithm are relatively simple, where we train a fairly small neural network to understand the context of language (training this network will require its own dataset, the irony is not lost on me). To accomplish this endeavor, I have been meeting with people in the field, both to gain insights into the intricacies of the method, but also to get a better understanding of the field as a whole to make sure I am using the right tool for the job. And what better place to meet these people than at a conference?


Robotics: Science and Systems conference in Freiburg (images: Parker Ewen)

Robotics: Science and Systems (RSS) conference, which took place end of June 2019 in Freiburg, was pretty diverse, although it focused more on actual robotics than pure machine learning. That’s not to say machine learning wasn’t present, just that application-based approaches took center stage over theory. I want to discuss two of the talks I found the most interesting.

When Robots Are Listening

The first talk was given by Cynthia Matuszek, a professor at University of Maryland. Her research focuses on semantic understandings of human language for robotic applications. Her presentation was particularly interesting to me due to the similarities with this project, but she also focused on unsupervised learning for language understanding, an approach I had previously not thought about.

A robot picks up objects combining natural language processing and computer vision

Professor Matuszek’s work focused on learning a language without prior assumptions. In much the same way babies learn a language, an operator would say a phrase describing something (ie: “Yellow, rectangular block”) and the robot would learn what these descriptors meant. With enough training, these robots could then locate objects based on spoken queries from an operator. In much the same way, we want our network to understand what language the authors use to indicate that they have created a new dataset.

Building on a Solid Foundation

Angela Schoellig, a robotics professor at the University of Toronto and someone whose research I am very interested in, also gave an inspiring talk on the reliability of machine learning and safety guarantees using these methods. While machine learning methods may be an efficient way to achieve a result, they are by no means reliable and there are currently very few ways by which we can guarantee reliability when using them. Her talk focused on combining proven, mathematically sound methods with machine learning in smart and interesting ways to help achieve that reliability; basically not using machine learning as a crutch but rather as an extension of classical approaches.

In traditional control theory, it is normally assumed that our model of the agent interacting with the environment is known (even though we know it may not be perfectly correct). Professor Schoellig discussed how we can start from this initial concept, and use machine learning in parallel with these traditional methods to learn a more accurate model as the system progresses through time; in this way these mathematically proven methods can offer some guarantees in reliability and stability while machine learning can be used to increase their accuracy. Drawing from this, it may be beneficial for our project to use more traditional natural language processing for our models, with machine learning helping to refine these models.

Among many other talks at RSS, these two inspired me the most to think outside the box. With the support of ETH Library Lab I hope to have this index up in the next couple of months, and using what I learned at RSS I believe I can make it more reliable and more useful than my original vision.


Parker Ewen

Master Student in Robotics, Systems & Control, ETH Zurich

We use cookies to help us give you the best possible user experience on our website. By continuing to browse the site you are agreeing to the use of cookies. More information about privacy can be found here.