It is projected that every day humans produce approximately 2.5 quintillion bytes of data each day. With this insane amount of new data, surely some of it must be redundant, right? For data science, analytics, and machine learning, this increase in the amount of data available leads to previously unthinkable new avenues for research. But while more and more data is being harvested for a variety of reasons, could better curation of the data we have already collected lead to better outcomes for research?

The Inflation of Irrelevance

With the amount of data available for machine learning purposes exploding over the last few years, it can seem like a daunting task to sort through it all for exactly what your project requires. While some big name datasets are frequently the go-to (KITTI and OpenImages come to mind), hundreds of new, specific datasets are published each year without much advertisement. For many researchers, advertising the datasets they create takes a back seat to publishing research results. This creates a cycle where more and more data is becoming available while a smaller and smaller percentage of it actually sees repeated use.

The increased availability of data may seem like a boon to some, but because of the reason mentioned above, it may actually become a hindrance. For specific machine learning applications, only a small subset of data may be essential to train with. With that in mind, researchers may themselves endeavor to create a dataset tailored to their specific research goals. At the end of their research, they may publish the dataset, but in a sea of already existing datasets, it may never be seen. If another researcher comes along with a similar research goal, instead of being able to re-use the already created dataset, having difficulty in finding it may lead that researcher to simply recreate a similar dataset. Not only could the new data be redundant, but researchers will waste their time in data collection rather than novel research, and in a competitive academic environment this could be seen as a waste of time.

With all that being said, this problem can be avoided. Drawing from the field of machine learning itself may even help provide the solution we need.

Less Is More

Federated learning, where training takes place across multiple decentralized devices, has become a useful tool for those in the AI community; the distributed nature of the process lends itself well to resource sharing. In much the same fashion, Federated Data could leverage datasets hosted across multiple servers and allow for samples from this sea of data to be conglomerated together into a set of more specific, project-relevant data, eliminating the need for creating entire new datasets.

Using InDexDa to index online scientific datasets published in academic papers.

This method would lend itself well to transfer learning, where pretraining networks on only relevant examples before later performing downstream training on task-specific examples can yield better outcomes than simply training on a massive, diverse dataset [1]. Indexing as many available datasets as possible would allow for the next step of the pipeline, possibly using search engines such as the Neural Data Server [2], to compile these project-relevant datasets from pieces of existing datasets available online. This is the motivation behind our project, inDexDa (“indexing datasets”). Our program will automatically compile datasets currently available online without the need for human involvement.

Automated Indexing of Datasets

While researchers may not advertise their datasets, many will publish them online, typically alongside a paper detailing the collection process and how different networks perform using them. The main theory behind the collection phase of our project is that by only using the paper’s abstract we should be able to determine whether or not a new dataset was published. From this hypothesis, we perform the following steps:

  1. Scrape arXiv and ScienceDirect for all papers relating to a search query.
  2. Run each abstract through a trained natural language processing network to obtain a classification (yes or no to a new dataset).
  3. If we are confident a new dataset is published, we scan the full paper for more information.

First Results and Next Steps

After several months of research and testing at the ETH Library Lab, we finally have a working prototype. For those interested, the project is located in the following GitHub repository. For the classification network we use Google’s pretrained BERT network, using our own dataset for downstream classification training. During testing we achieved a 98% accuracy score and the preliminary results are promising!

Some may be wondering why put in the effort to index these datasets when other platforms exist which do something similar. Google Dataset Search, for example, has already indexed thousands of datasets. The issue with Google, and other services which are currently available, is that they index only particular data. Google will only register a dataset if specific criteria are met by the website hosting it and other smaller dataset collections need to be hand indexed which is a lengthy process. We believe inDexDa will overcome these pitfalls by automating data collection using methods researchers use to publish their data rather than brute-force human effort or webpage specifications.

Work is not finished however. We are hoping to implement a database to help track the datasets we have found, along with meta information about them and the paper they are associated with. Lastly, we are making the project open-source in the hopes that other researchers and those simply interested in our work can help push this project further so that one day it may be a standard tool in the research community and help alleviate some of the stress associated with the data collection requirements for machine learning.

For all those interested in assisting with the project, you can find it at the following link. We will keep the platform updated and look forward to receive your inputs to further improve it!

[1]  Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q., & Pang, R. (n.d.). Domain Adaptive Transfer Learning with Specialist Models.

[2] Yan, X., Acuna, D., & Fidler, S. (n.d.). Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data.


Parker Ewen

Master Student in Robotics, Systems & Control, ETH Zurich

We use cookies to help us give you the best possible user experience on our website. By continuing to browse the site you are agreeing to the use of cookies. More information about privacy can be found here.