1 Comment
Image classification is one of the most widely tackled tasks in the field of computer vision. It involves predicting one or more relevant labels for a given image. Generally, when performing classification, the various labels are assumed to be independent of each other, which often is not the case and could mean that classification models are missing out on significant improvements in performance.

We show how taking into consideration label interactions helps to improve arbitrary image classification models. We turn the research findings into something that can be applied directly at biodiversity collections and libraries. The general nature of the proposed approach makes it a good fit to create and improve a vast variety of data-centred applications.

Leveraging not just image-label but also label-label interactions

After introducing the project in our last blog post “Enabling Biodiversity Research with Automated Species Identification”, this post focuses on the importance of label-hierarchy and label interactions for image classification. Let us start with a familiar example: if you take an image of a “truck” and an image of a “car”, semantically they both also belong to the more abstract category of “vehicle”. There are also visual similarities between a “truck” and a “car”. For instance, they both have wheels and headlamps. The key observation is that the two concepts are linked both semantically and visually. Convolutional neural networks are able to extract visual features – but is there a way we can also make the semantic information available to the image classifier? This is the question we attempt to answer. In doing so, we observe how different ways of informing the classifier can impact the classifier’s performance.

Providing models with visual information from images and semantic information via label-hierarchy

Although semantic labels such as names of species often arise due to visual similarity, this is not always the case and could also arise due to non-visual elements, like similarity on the genetic level. Based on a purely visual inspection of figure 1, you might quickly jump to the conclusion that the butterfly specimens in the upper row belong to the same category while the butterflies in the lower row belong to another such category. However, this is not so straight-forward as (a) and (b) belong to two separate genera and species but have a really low inter-category variance. On the other hand, (b), (c) and (d) all share the same genus Parnassius but have a larger intra-category visual variance than (a) and (b). Visual similarity does not necessarily imply semantic similarity and vice-versa.

Figure 1: four different butterfly specimens from the ETHEC [4, 5, 8] dataset.

Reducing the black-box nature of neural networks

Another benefit of utilising hierarchical groups in a classification model is to reduce the black-box nature of recent image classification techniques. If a human is asked to assign a label to an image, the natural way to proceed is by assigning an abstract label and then rationalising finer-grained labels. Even though it might be hard for an untrained eye to distinguish between an Alaskan Malamute and a Siberian Husky one can begin by concluding that the image is of an “animal”, which is a more abstract label, and more specifically of a “dog”, a relatively more fine-grained label. By using the hierarchy that stems out of the labels to guide our classification models, we bridge one gap between the human way and the machine way of performing label assignment. Our models strive to make the models more interpretable and improve their explainability.

Tackling imbalanced, real-world data

Datasets widely used in the literature are generally well-balanced in contrast to the imbalance which is a common characteristic of real-world data. Category labels arranged as a hierarchy can be viewed as a tree [7]. Abstract labels reside closer to the root of the tree. These labels become increasingly fine-grained as one moves towards the leaves of the tree. An abstract label or concept is a conglomeration of multiple fine-grained concepts. This leads to an imbalance where abstract labels end up having more images than their fine-grained counterparts. For instance, imagine a hypothetical dataset with images of objects with “car”, “bus”, “truck”, “human”, “dog” and “cat” as categories. Here, “vehicle” which is a more abstract concept has “car”, “truck” and “bus” as its finer-grained sub-concepts. Even if “car”, “truck” and “bus” all have 20 images each in this dataset, the “vehicle” category has 60 (=20+20+20) images. And because there are exponentially more fine-grained labels than abstract labels, as the label hierarchy is a tree, it eventually leads to a long-tailed distribution where most labels have very few images while a handful of labels have a major portion of the images. This is shown in Figure 2.

Figure 2: Data distribution for each of the four label categories: family, sub-family, genus and species in the real-world ETHEC dataset. The y-axis of the histograms represents the number of image-label pairs for a particular label. The x-axis is the label-id for each label. One can clearly observe the skewed data distributions.

For experiments, we use the ETHEC dataset [8] consisting of images of butterflies of Lepidoptera residing at the ETH Entomological Collection. The dataset has images paired with their taxonomy metadata. We consider the family, subfamily, genus and species to help classify images. The varying number of specimens for a particular organism leads to an unequal number of images per organism. This makes it challenging to perform image classification with widely used techniques which work well with a large set of labelled images. Our goal is to use the taxonomic hierarchy as a guide for our models to perform image classification in the light of imbalanced data as seen in Figure 2. The dataset consists of images of 47,978 butterfly specimens with 723 labels spread across 4 levels of hierarchical labels: 6 family, 21 subfamily, 135 genus and 550 species.

We propose two kinds of models for classifying these specimens: (1) Convolutional Neural Network (CNN)-based classifiers and (2) embedding-based models. Instead of creating specifically designed components for CNNs, we focus on something more generalisable that can be used across any task modality, not just image classification. By altering the loss functions to inject information about the hierarchy, this technique could be implemented with any generic CNN feature extractor. Figure 3 provides a brief overview of the different ways in which we inject information about the label hierarchy into an arbitrary CNN feature extractor and use it for classifying images. The figure also depicts various levels of abstraction in which the classifier is made aware of the hierarchy induced by the labels.

Figure 3: A brief summary of the different ways in which different models exploit the label-hierarchy.

Injecting CNN classifiers with label-hierarchy

We begin with a hierarchy-agnostic baseline classifier which is unaware of the presence of a hierarchy between the image labels. Moving forward incrementally with every model, we include more information about the hierarchy in various forms of abstraction. We begin with a classifier where we include information about the number of levels in the hierarchy (yellow in figure 3) and limit the number of label predictions made by the model by design such that it predicts a label for each level in the hierarchy; i.e. one each for the family, subfamily, genus and species. With the next classifier, we inject information such as the edge relations between different labels (brown in figure 3) and finally go to the extent of having models that are capable of exploiting how complete subtrees are arranged in the label-hierarchy (red in figure 3). All this is in addition to the visual features that are extracted by the CNN feature extractor from the image.

Performing image classification using joint image-label order-preserving embeddings

For the second class of models that we investigate – the embedding-based models – we learn to embed high-dimensional objects, resulting in a meaningful low-dimensional embedding. In our case, the high-dimensional objects are images. Talking about the concept of embeddings that are meaningful, the first thing that usually comes to mind is having similar objects being embedded close together via the notion of physical distance or proximity. This closeness is defined by a distance function, such as the Euclidean distance or cosine distance. Our models, on the other hand, use order-preserving embeddings. In this method, each embedding has a region in the embedding space that it ‘owns’. Ideally, everything that is a sub-concept (child) of the more abstract (parent) embedding should fall within the parent’s ‘owned’ area. More formally, the embeddings we use create areas that are convex cones in the embedding space. A convex cone formed by entailment cone embeddings [2] is shown and compared to order-embeddings [1] in Figure 4 in Euclidean space.

Figure 4: Comparing the form of the order-embeddings represented by nested positive orthants and entailment cones represented by nested convex cones in 2D Euclidean space.

The idea of having order-preserving embeddings instead of the more commonly used distance preserving embeddings allows for capturing asymmetric relations such as concept abstractness and transitivity that arise from their ability to arrange themselves into nested regions. In this order-preserving space, abstract regions encompass their finer-grained sub-concepts, akin to a nested Matryoshka doll. Such an embedding does not care about the distance in terms of physical proximity, it cares whether or not a concept is within the area owned by its parent. Thus, it could be possible that a concept is physically far from the parent but as long as it is within the area ‘owned’ by the parent, it is a valid embedding. This form of embedding helps to arrange the space in a hierarchical fashion and capture the properties of a discrete tree graph in continuous space. In other words, it is a continuous analogue of the tree graph structure. Images represent leaf nodes in this tree at the bottom of the label-hierarchy as they are the most fine-grained objects in the joint image-label hierarchy. Once the images and labels are embedded in a joint space, it allows for querying all labels of an unseen image and retrieving them at different levels of abstraction by virtue of the arrangement of the embedding space.

Figure 5: two-dimensional Euclidean cones for the ETHEC dataset in ℝ^2 on the left with the actual label-hierarchy as a tree graph on the right.

Lifting order-preserving embeddings to hyperbolic space

All this occurs in the Euclidean space governed by Euclidean geometry. However, there are non-Euclidean geometries that behave very differently from the Euclidean geometry that we are familiar with. An example for such a non-Euclidean geometry is the less frequently encountered hyperbolic and spherical geometry. One such difference is shown in Figure 6, the sum of angles in a triangle in different geometries. We lift the concepts discussed previously and lift them from Euclidean space to hyperbolic space by making some adjustments to account for the change in geometry. The embedding space can now take advantage of hyperbolic geometry. This empowers the model to embed tree-like structure with very low distortion [6]. A tree – a discrete data structure – has its nodes growing exponentially with the height of the tree. The hyperbolic space which is also growing exponentially can be thought of as a continuous analogue of a tree, making it a reasonable choice to model the label-hierarchy in our data. The hyperbolic model has improved performance over the model living in Euclidean space. The learned representations are more meaningful and also have better performance for the task of hierarchical image classification.

Figure 6: A common difference in Euclidean and non-Euclidean geometry: spherical and hyperbolic, is the sum of angles in a triangle.

Applying the research and scaling to real-world systems

Real-world data such as those from biodiversity and natural history collections are often imbalanced. However, they usually have valuable label data that can be used. Our methods “regularise” the models using the hierarchy that arises among the labels. This facilitates the use of machine-learning methods in the biodiversity and natural history areas where the data can be scarce and unbalanced but has a hierarchical structure to it.

Figure 7: Evolution of 2D Euclidean cone embeddings for the ETHEC label-hierarchy.

Summary and Outlook

We propose to use embedding-based models for image classification in both Euclidean and hyperbolic space and show that hyperbolic geometry provides an empirical advantage. The tree-like nature of the hyperbolic space is similar to the tree-like nature of the label-hierarchy, making it a great medium to pass on label-hierarchy information to the image classifier. We evaluate our methods on the real-world ETHEC dataset and show that exploiting hierarchical information always leads to an improvement over a shallow CNN classifier. We design the models so that they can digest semantic information on top of the visual information they receive from the image.

This procedure can pave the way forward for machine-learning models that are better informed about the label-hierarchy. This can help improve on down-stream tasks such as image captioning and image retrieval, among others.

In addition to researchers, this could also be useful for collections and libraries looking to improve classification of biodiversity specimens, categorising and labelling book libraries, sorting e-resources based on relevance, providing visual search tools for archives (like arxiv) and studying as well as understanding large image databases, ranging from abstract all the way to fine-grained patterns.


[1] Order-Embeddings; I Vendrov, R Kiros, S Fidler, R Urtasun

[2] Hyperbolic Entailment Cones; OE Ganea, G Bécigneul, T Hofmann

[3] “Nested doll image”, source: giphy/Dots

[4] Learning Representations For Images With Hierarchical Labels; A Dhall

[5] Hierarchical Image Classification using Entailment Cone Embeddings; A Dhall, A Makarova, OE Ganea, D Pavllo, M Greeff, A Krause

[6] Low distortion delaunay embedding of trees in hyperbolic plane; R Sarkar

[7] Tree; https://en.wikipedia.org/wiki/Tree_(graph_theory)

[8] ETH Entomological Collection (ETHEC) Dataset https://www.research-collection.ethz.ch/handle/20.500.11850/365379

All other images are from the author’s work [4, 5], please cite when reusing.

Article related tags

Ankit Dhall

Computer Vision & Machine Learning Specialist, Alumnus Innovator Fellowship Program

We use cookies to help us give you the best possible user experience on our website. By continuing to browse the site you are agreeing to the use of cookies. More information about privacy can be found here.