Many open-source datasets are available, but they are not collected in one place. Moreover, the datasets are not well described, not categorized in which particular subfield of e.g.: Proteomics they correspond, and they are often very dirty.
In a further step, the dataset collections could be containerized per Proteomics sub-field, with its descriptive metadata. In this way research scientists and / or fellow students within the field of biochemistry, could only load the respective containers / packages over a “Python” or “R” command and use them directly for machine learning as train and test datasets.
In the publication of (Mann et al., 2021) the increasing importance of transfer learning and the elaboration of a more transparent open-source architectures that allow combination of different data, is well described. This is especially needed within the field of OMICS-data, which are nowadays extensively used for precision medicine. Up to today, the search for appropriate open-source datasets and the cleaning is still done manually and for every project from the scratch (unless an internal database was set-up in the respective research institute), this approach is very time consuming and also brings in some bias between research groups and institutions, since not all are using the same data for training and testing of the AI/ML models. This problem can be solved by establishing a standard collection of open-source datasets and making them easily available over constrainers or packages, which is the main goal of this project.