Cross-modality retrieval is the task of retrieving relevant items of a different modality than the search query (e.g., retrieving an image given a text query). One approach to tackle this problem is to define transformations which embed samples from different modalities into a common vector space. We can then project a query into this embedding space, and retrieve, using nearest-neighbor search, a corresponding candidate projected from another modality.
A particularly successful class of models uses parametric nonlinear transformations (e.g., neural networks) for the embedding projections, optimized via a retrieval-specific objective such as a pairwise ranking loss [15, 27]. This loss aims at decreasing the distance (a differentiable function such as Euclidean or cosine distance) between matching items, while increasing it between mismatching ones. Specialized extensions of this loss achieved state-of-the-art results in various domains such as natural language processing [10], image captioning [12], and text-to-image retrieval [29].
In a different approach, Yan and Mikolajczyk [31] propose to learn a joint embedding of text and images using Deep canonical correlation analysis (DCCA) [2]. Instead of a pairwise ranking loss, DCCA directly optimizes the correlation of learned latent representations of the two views. Given the correlated embedding representations of the two views, it is possible to perform retrieval via cosine distance. The promising performance of their approach is also in line with the findings of Costa et al. [23] who state the following two hypotheses regarding the properties of efficient cross-modal retrieval spaces: first, the embedding spaces should account for low-level cross-modal correlations and second, they should enable semantic abstraction. In [31], both properties are met by a deep neural network—learning abstract representations—that is optimized with DCCA ensuring highly correlated latent representations.
In summary, the optimization of pairwise ranking losses yields embedding spaces that are useful for retrieval, and allows incorporating domain knowledge into the loss function. On the other hand, DCCA is designed to maximize correlation—which has already proven to be useful for cross-modality retrieval [31]—but does not allow to use loss formulations specialized for the task at hand.
In this paper, we propose a method to combine both approaches in a way that retains their advantages. We develop a Canonical Correlation Analysis Layer (CCAL) that can be inserted into a dual-view neural network to produce a maximally correlated embedding space for its latent representations. We can then apply task-specific loss functions, in particular the pairwise ranking loss, on the output of this layer. To train a network using the CCA layer, we describe how to backpropagate the gradient of this loss function to the dual-view neural network while relying on automatic differentiation tools such as Theano [28] or Tensorflow [1]. In our experiments, we show that our proposed method performs better than DCCA and models using pairwise ranking loss alone, especially when little training data is available.
Contribute to jukedeck/nottingham-dataset development by creating an account on GitHub. The original ABC files were sourced from the Nottingham Music Database. Here we list the manual and programmatic modifications we made to the ABC files to be able to convert them to MIDI. Geographical Original of Music Data Set Download: Data Folder, Data Set Description. Abstract: Instances in this dataset contain audio features extracted from 1059 wave files.The task associated with the data is to predict the geographical origin of music.
Figure 1 compares our proposed approach to the alternatives discussed above. DCCA defines an objective optimizing a dual-view neural network such that its two views will be maximally correlated (Fig. 1a). Pairwise ranking losses are loss functions to optimize a dual-view neural network such that its two views are well-suited for nearest-neighbor retrieval in the embedding space (Fig. 1b). In our approach, we boost optimization of a pairwise ranking loss based on cosine distance by placing a special-purpose layer, the CCA projection layer, between a dual-view neural network and the optimization target (Fig. 1c). Our experiments in Sect. 5 will show the effectiveness of this proposal.