Unsupervised Visual Representation Learning by Context Prediction

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

http://graphics.cs.cmu.edu/projects/deepContext/

Unsupervised Visual Representation Learning by Context Prediction

Presented at ICCV 2015

  • Carl Doersch
  • Abhinav Gupta
  • Alexei A. Efros

This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

Paper & Presentation

Code and Network

Additional materials.

Alexander Mordvintsev decided to visualize the contents of our VGG-style network by applying Deep Dream separately to each filter in our network, and has kindly shared his results with us. Below are 8 of the filters in conv5_3 (the second-to-last layer before the representations are fused). Below each is my interpretation. Mouse over them to see it (I don't want to bias your interpretation!)

  • Full Results: Paris Elements An example of the results obtained entirely automatically, using our algorithm to find the unique visual elements of Paris. Compare to our previous results on the same data. WARNING: This page contains about 10000 patches; don't click the link unless your browser can handle it!
  • Visual Elements Learned For Indoor 67 The elements used for our state-of-the-art Indoor 67 classifier.
  • Heatmaps generated for both confident correctly-classified images on Indoor 67 as well as confident errors .
  • Learned element templates and full image feature vectors for indoor67 images (.mat files in .tar.gz): all (3.3GB); elements only (117MB); README describing archive contents.

Related Papers

We first proposed context as supervision in: C. Doersch, A. Gupta, and A. A. Efros. Context as Supervisory Signal: Discovering Objects with Predictable Context European Conference on Computer Vision, September 2014.

  • Google Graduate Fellowship to Carl Doersch
  • ONR MURI N000141010934
  • Intel research grant
  • NVidia hardware grant
  • Amazon Web Services grant

Comments, questions to Carl Doersch

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

Author's implementation of 'Unsupervised Visual Representation Learning by Context Prediction'

cdoersch/deepcontext

Folders and files, repository files navigation, unsupervised representation learning by context prediction.

Created by Carl Doersch (Carnegie Mellon / UC Berkeley)

Introduction

This code is designed to train a visual representation from a raw, unlabeled image collection. The resulting representation seems to be useful for standard vision tasks like object detection, surface normal estimation, and visual data mining.

This algorithm was originally described in Unsupervised Visual Representation Learning by Context Prediction , which was presented at ICCV 2015.

This code is significantly refactored from what was used to produce the results in the paper, and minor modifications have been made. While I do not expect these modifications to significantly impact results, I have not yet fully tested the new codebase, and will need a few more weeks to do so. Qualitative behavior early in the training on appears to be equivalent, but you should still use this code with caution.

Citing this codebase

If you find this code useful, please consider citing:

Installation

  • Clone the deepcontext repository

Build Caffe and pycaffe

External caffe installations should work as well, but need to be downloaded from Github later than November 22, 2015 to support all required features.

Copy deepcontext_config.py.example to deepcontext_config.py and edit it to supply your path to ImageNet, and provide an output directory that the code can use for temporary files, including snapshots.

Execute train.py inside python. This will begin an infinite training loop, which snapshots every 2000 iterations. The results in the paper used a model that trained for about 1M iterations.

By the code will run on GPU 0; you can use the environment variable CUDA_VISIBLE_DEVICES to change the GPU.

All testing was done with python 2.7. It is recommended that you run inside ipython using execfile('train.py') .

To stop the train.py script, create the file train_quit in the directory where you ran the code. This roundabout approach is required because the code starts background processes to load data, and it's difficult to guarantee that these background threads will be terminated if the code is interrupted via Ctrl+C .

If train.py is re-started after it is quit, it will examine the output directory and attempt to continue from the snapshot with the higest iteration number.

You can pause the training at any time by creating the file train_pause in the directory where you ran the code. This will let you use pycaffe to examine the network state. Re-run train.py to continue.

For our experiments, we ran for 1.7M iterations. After this point, you can run debatchnorm.py on the output (you'll need your own copy of a caffenet with the groups removed). Once you've run it, then you have a model that can be fine-tuned. I recommend using our data-dependent initialization and calibration procedure [ Krähenbühl et al. ] before fine-tuning, as debatchnorm.py will lead to badly-scaled weights. The network trained using this procedure and fine-tuned with fast-rcnn on VOC2007 achieves 51.4% MAP.

  • Python 100.0%

Unsupervised Representation Learning by Predicting Image Rotations

Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4 % percent 54.4 54.4\% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification. The code and models of our paper will be published on: https://github.com/gidariss/FeatureLearningRotNet .

1 Introduction

In recent years, the widespread adoption of deep convolutional neural networks  (LeCun et al., 1998 ) (ConvNets) in computer vision, has lead to a tremendous progress in the field. Specifically, by training ConvNets on the object recognition  (Russakovsky et al., 2015 ) or the scene classification  (Zhou et al., 2014 ) tasks with a massive amount of manually labeled data, they manage to learn powerful visual representations suitable for image understanding tasks. For instance, the image features learned by ConvNets in this supervised manner have achieved excellent results when they are transferred to other vision tasks, such as object detection  (Girshick, 2015 ) , semantic segmentation  (Long et al., 2015 ) , or image captioning  (Karpathy & Fei-Fei, 2015 ) . However, supervised feature learning has the main limitation of requiring intensive manual labeling effort, which is both expensive and infeasible to scale on the vast amount of visual data that are available today.

Due to that, there is lately an increased interest to learn high level ConvNet based representations in an unsupervised manner that avoids manual annotation of visual data. Among them, a prominent paradigm is the so-called self-supervised learning that defines an annotation free pretext task, using only the visual information present on the images or videos, in order to provide a surrogate supervision signal for feature learning. For example, in order to learn features, Zhang et al. ( 2016a ) and  Larsson et al. ( 2016 ) train ConvNets to colorize gray scale images, Doersch et al. ( 2015 ) and Noroozi & Favaro ( 2016 ) predict the relative position of image patches, and  Agrawal et al. ( 2015 ) predict the egomotion (i.e., self-motion) of a moving vehicle between two consecutive frames. The rationale behind such self-supervised tasks is that solving them will force the ConvNet to learn semantic image features that can be useful for other vision tasks. In fact, image representations learned with the above self-supervised tasks, although they have not managed to match the performance of supervised-learned representations, they have proved to be good alternatives for transferring on other vision tasks, such as object recognition, object detection, and semantic segmentation  (Zhang et al., 2016a ; Larsson et al., 2016 ; Zhang et al., 2016b ; Larsson et al., 2017 ; Doersch et al., 2015 ; Noroozi & Favaro, 2016 ; Noroozi et al., 2017 ; Pathak et al., 2016a ; Doersch & Zisserman, 2017 ) . Other successful cases of unsupervised feature learning are clustering based methods  (Dosovitskiy et al., 2014 ; Liao et al., 2016 ; Yang et al., 2016 ) , reconstruction based methods  (Bengio et al., 2007 ; Huang et al., 2007 ; Masci et al., 2011 ) , and methods that involve learning generative probabilistic models  Goodfellow et al. ( 2014 ); Donahue et al. ( 2016 ); Radford et al. ( 2015 ) .

Our work follows the self-supervised paradigm and proposes to learn image representations by training ConvNets to recognize the geometric transformation that is applied to the image that it gets as input. More specifically, we first define a small set of discrete geometric transformations, then each of those geometric transformations are applied to each image on the dataset and the produced transformed images are fed to the ConvNet model that is trained to recognize the transformation of each image. In this formulation, it is the set of geometric transformations that actually defines the classification pretext task that the ConvNet model has to learn. Therefore, in order to achieve unsupervised semantic feature learning, it is of crucial importance to properly choose those geometric transformations (we further discuss this aspect of our methodology in section  2.2 ). What we propose is to define the geometric transformations as the image rotations by 0, 90, 180, and 270 degrees. Thus, the ConvNet model is trained on the 4-way image classification task of recognizing one of the four image rotations (see Figure  3 ). We argue that in order a ConvNet model to be able recognize the rotation transformation that was applied to an image it will require to understand the concept of the objects depicted in the image (see Figure  2 ), such as their location in the image, their type, and their pose. Throughout the paper we support that argument both qualitatively and quantitatively. Furthermore we demonstrate on the experimental section of the paper that despite the simplicity of our self-supervised approach, the task of predicting rotation transformations provides a powerful surrogate supervision signal for feature learning and leads to dramatic improvements on the relevant benchmarks.

Note that our self-supervised task is different from the work of  Dosovitskiy et al. ( 2014 ) and  Agrawal et al. ( 2015 ) that also involves geometric transformations. Dosovitskiy et al. ( 2014 ) train a ConvNet model to yield representations that are discriminative between images and at the same time invariant on geometric and chromatic transformations. In contrast, we train a ConvNet model to recognize the geometric transformation applied to an image. It is also fundamentally different from the egomotion method of  Agrawal et al. ( 2015 ) , which employs a ConvNet model with siamese like architecture that takes as input two consecutive video frames and is trained to predict (through regression) their camera transformation. Instead, in our approach, the ConvNet takes as input a single image to which we have applied a random geometric transformation (i.e., rotation) and is trained to recognize (through classification) this geometric transformation without having access to the initial image.

Our contributions are:

We propose a new self-supervised task that is very simple and at the same time, as we demonstrate throughout the paper, offers a powerful supervisory signal for semantic feature learning.

We exhaustively evaluate our self-supervised method under various settings (e.g. semi-supervised or transfer learning settings) and in various vision tasks (i.e., CIFAR-10, ImageNet, Places, and PASCAL classification, detection, or segmentation tasks).

In all of them, our novel self-supervised formulation demonstrates state-of-the-art results with dramatic improvements w.r.t. prior unsupervised approaches.

As a consequence we show that for several important vision tasks, our self-supervised learning approach significantly narrows the gap between unsupervised and supervised feature learning.

In the following sections, we describe our self-supervised methodology in §2, we provide experimental results in §3, and finally we conclude in §4.

Refer to caption

90 ∘ superscript 90 90^{\circ} rotation

Refer to caption

270 ∘ superscript 270 270^{\circ} rotation

Refer to caption

180 ∘ superscript 180 180^{\circ} rotation

Refer to caption

0 ∘ superscript 0 0^{\circ} rotation

Refer to caption

2 Methodology

2.1 overview.

The goal of our work is to learn ConvNet based semantic features in an unsupervised manner. To achieve that goal we propose to train a ConvNet model F ( . ) F(.) to estimate the geometric transformation applied to an image that is given to it as input. Specifically, we define a set of K 𝐾 K discrete geometric transformations G = { g ( . | y ) } y = 1 K G=\{g(.|y)\}_{y=1}^{K} , where g ( . | y ) g(.|y) is the operator that applies to image X 𝑋 X the geometric transformation with label y 𝑦 y that yields the transformed image X y = g ​ ( X | y ) superscript 𝑋 𝑦 𝑔 conditional 𝑋 𝑦 X^{y}=g(X|y) . The ConvNet model F ( . ) F(.) gets as input an image X y ∗ superscript 𝑋 superscript 𝑦 X^{y^{*}} (where the label y ∗ superscript 𝑦 y^{*} is unknown to model F ( . ) F(.) ) and yields as output a probability distribution over all possible geometric transformations:

Therefore, given a set of N 𝑁 N training images D = { X i } i = 0 N 𝐷 superscript subscript subscript 𝑋 𝑖 𝑖 0 𝑁 D=\{X_{i}\}_{i=0}^{N} , the self-supervised training objective that the ConvNet model must learn to solve is:

where the loss function l o s s ( . ) loss(.) is defined as:

In the following subsection we describe the type of geometric transformations that we propose in our work.

2.2 Choosing geometric transformations: image rotations

Refer to caption

In the above formulation, the geometric transformations G 𝐺 G must define a classification task that should force the ConvNet model to learn semantic features useful for visual perception tasks (e.g., object detection or image classification). In our work we propose to define the set of geometric transformations G 𝐺 G as all the image rotations by multiples of 90 degrees, i.e., 2d image rotations by 0, 90, 180, and 270 degrees (see Figure  3 ). More formally, if R ​ o ​ t ​ ( X , ϕ ) 𝑅 𝑜 𝑡 𝑋 italic-ϕ Rot(X,\phi) is an operator that rotates image X 𝑋 X by ϕ italic-ϕ \phi degrees, then our set of geometric transformations consists of the K = 4 𝐾 4 K=4 image rotations G = { g ​ ( X | y ) } y = 1 4 𝐺 superscript subscript 𝑔 conditional 𝑋 𝑦 𝑦 1 4 G=\{g(X|y)\}_{y=1}^{4} , where g ​ ( X | y ) = R ​ o ​ t ​ ( X , ( y − 1 ) ​ 90 ) 𝑔 conditional 𝑋 𝑦 𝑅 𝑜 𝑡 𝑋 𝑦 1 90 g(X|y)=Rot(X,(y-1)90) .

Forcing the learning of semantic features: The core intuition behind using these image rotations as the set of geometric transformations relates to the simple fact that it is essentially impossible for a ConvNet model to effectively perform the above rotation recognition task unless it has first learnt to recognize and detect classes of objects as well as their semantic parts in images. More specifically, to successfully predict the rotation of an image the ConvNet model must necessarily learn to localize salient objects in the image, recognize their orientation and object type, and then relate the object orientation with the dominant orientation that each type of object tends to be depicted within the available images. In Figure  5(b) we visualize some attention maps generated by a model trained on the rotation recognition task. These attention maps are computed based on the magnitude of activations at each spatial cell of a convolutional layer and essentially reflect where the network puts most of its focus in order to classify an input image. We observe, indeed, that in order for the model to accomplish the rotation prediction task it learns to focus on high level object parts in the image, such as eyes, nose, tails, and heads. By comparing them with the attention maps generated by a model trained on the object recognition task in a supervised way (see Figure  5(a) ) we observe that both models seem to focus on roughly the same image regions. Furthermore, in Figure  7 we visualize the first layer filters that were learnt by an AlexNet model trained on the proposed rotation recognition task. As can be seen, they appear to have a big variety of edge filters on multiple orientations and multiple frequencies. Remarkably, these filters seem to have a greater amount of variety even than the filters learnt by the supervised object recognition task.

Refer to caption

Input images on the models

Refer to caption

Conv1 27 × 27 27 27 27\times 27

Refer to caption

Conv3 13 × 13 13 13 13\times 13

Refer to caption

Conv5 6 × 6 6 6 6\times 6

Refer to caption

Absence of low-level visual artifacts: An additional important advantage of using image rotations by multiples of 90 degrees over other geometric transformations, is that they can be implemented by flip and transpose operations (as we will see below) that do not leave any easily detectable low-level visual artifacts that will lead the ConvNet to learn trivial features with no practical value for the vision perception tasks. In contrast, had we decided to use as geometric transformations, e.g., scale and aspect ratio image transformations, in order to implement them we would need to use image resizing routines that leave easily detectable image artifacts.

Well-posedness: Furthermore, human captured images tend to depict objects in an “up-standing” position, thus making the rotation recognition task well defined, i.e., given an image rotated by 0, 90, 180, or 270 degrees, there is usually no ambiguity of what is the rotation transformation (with the exception of images that only depict round objects). In contrast, that is not the case for the object scale that varies significantly on human captured images.

Implementing image rotations: In order to implement the image rotations by 90, 180, and 270 degrees (the 0 degrees case is the image itself), we use flip and transpose operations. Specifically, for 90 degrees rotation we first transpose the image and then flip it vertically (upside-down flip), for 180 degrees rotation we flip the image first vertically and then horizontally (left-right flip), and finally for 270 degrees rotation we first flip vertically the image and then we transpose it.

2.3 Discussion

The simple formulation of our self-supervised task has several advantages. It has the same computational cost as supervised learning, similar training convergence speed (that is significantly faster than image reconstruction based approaches; our AlexNet model trains in around 2 days using a single Titan X GPU), and can trivially adopt the efficient parallelization schemes devised for supervised learning  (Goyal et al., 2017 ) , making it an ideal candidate for unsupervised learning on internet-scale data (i.e., billions of images). Furthermore, our approach does not require any special image pre-processing routine in order to avoid learning trivial features, as many other unsupervised or self-supervised approaches do. Despite the simplicity of our self-supervised formulation, as we will see in the experimental section of the paper, the features learned by our approach achieve dramatic improvements on the unsupervised feature learning benchmarks.

3 Experimental Results

In this section we conduct an extensive evaluation of our approach on the most commonly used image datasets, such as CIFAR-10  (Krizhevsky & Hinton, 2009 ) , ImageNet  (Russakovsky et al., 2015 ) , PASCAL  (Everingham et al., 2010 ) , and Places205  (Zhou et al., 2014 ) , as well as on various vision tasks, such as object detection, object segmentation, and image classification. We also consider several learning scenarios, including transfer learning and semi-supervised learning. In all cases, we compare our approach with corresponding state-of-the-art methods.

3.1 CIFAR experiments

We start by evaluating on the object recognition task of CIFAR-10 the ConvNet based features learned by the proposed self-supervised task of rotation recognition. We will here after call a ConvNet model that is trained on the self-supervised task of rotation recognition RotNet model.

Implementation details: In our CIFAR-10 experiments we implement the RotNet models with Network-In-Network (NIN) architectures  (Lin et al., 2013 ) . In order to train them on the rotation prediction task, we use SGD with batch size 128 128 128 , momentum 0.9 0.9 0.9 , weight decay 5 ​ e − 4 5 𝑒 4 5e-4 and l ​ r 𝑙 𝑟 lr of 0.1. We drop the learning rates by a factor of 5 5 5 after epochs 30, 60, and 80. We train in total for 100 epochs. In our preliminary experiments we found that we get significant improvement when during training we train the network by feeding it all the four rotated copies of an image simultaneously instead of each time randomly sampling a single rotation transformation. Therefore, at each training batch the network sees 4 times more images than the batch size.

Evaluation of the learned feature hierarchies: First, we explore how the quality of the learned features depends from their depth (i.e., the depth of the layer that they come from) as well as from the total depth of the RotNet model. For that purpose, we first train using the CIFAR-10 training images three RotNet models which have 3, 4, and 5 convolutional blocks respectively (note that each conv. block in the NIN architectures that implement our RotNet models have 3 conv. layers; therefore, the total number of conv. layers of the examined RotNet models is 9, 12, and 15 for 3, 4, and 5 conv. blocks respectively). Afterwards, we learn classifiers on top of the feature maps generated by each conv. block of each RotNet model. Those classifiers are trained in a supervised way on the object recognition task of CIFAR-10. They consist of 3 fully connected layers; the 2 hidden layers have 200 feature channels each and are followed by batch-norm and relu units. We report the accuracy results of CIFAR-10 test set in Table  1 . We observe that in all cases the feature maps generated by the 2nd conv. block (that actually has depth 6 in terms of the total number of conv. layer till that point) achieve the highest accuracy, i.e., between 88.26% and 89.06%. The features of the conv. blocks that follow the 2nd one gradually degrade the object recognition accuracy, which we assume is because they start becoming more and more specific on the self-supervised task of rotation prediction. Also, we observe that increasing the total depth of the RotNet models leads to increased object recognition performance by the feature maps generated by earlier layers (and after the 1st conv. block). We assume that this is because increasing the depth of the model and thus the complexity of its head (i.e., top ConvNet layers) allows the features of earlier layers to be less specific to the rotation prediction task.

Exploring the quality of the learned features w.r.t. the number of recognized rotations: In Table  2 we explore how the quality of the self-supervised features depends on the number of discrete rotations used in the rotation prediction task. For that purpose we defined three extra rotation recognition tasks: (a) one with 8 rotations that includes all the multiples of 45 degrees, (b) one with only the 0 ∘ superscript 0 0^{\circ} and 180 ∘ superscript 180 180^{\circ} rotations, and (c) one with only the 90 ∘ superscript 90 90^{\circ} and 270 ∘ superscript 270 270^{\circ} rotations. In order to implement the rotation transformation of the 45 ∘ superscript 45 45^{\circ} , 135 ∘ superscript 135 135^{\circ} , 225 ∘ superscript 225 225^{\circ} , 270 ∘ superscript 270 270^{\circ} , and 315 ∘ superscript 315 315^{\circ} rotations (in the 8 discrete rotations case) we used an image wrapping routine and then we took care to crop only the central square image regions that do not include any of the empty image areas introduced by the rotation transformations (and which can easily indicate the image rotation). We observe that indeed for 4 discrete rotations (as we proposed) we achieve better object recognition performance than the 8 or 2 cases. We believe that this is because the 2 orientations case offers too few classes for recognition (i.e., less supervisory information is provided) while in the 8 orientations case the geometric transformations are not distinguishable enough and furthermore the 4 extra rotations introduced may lead to visual artifacts on the rotated images. Moreover, we observe that among the RotNet models trained with 2 discrete rotations, the RotNet model trained with 90 ∘ superscript 90 90^{\circ} and 270 ∘ superscript 270 270^{\circ} rotations achieves worse object recognition performance than the model trained with the 0 ∘ superscript 0 0^{\circ} and 180 ∘ superscript 180 180^{\circ} rotations, which is probably due to the fact that the former model does not “see” during the unsupervised phase the 0 ∘ superscript 0 0^{\circ} rotation that is typically used during the object recognition training phase.

Comparison against supervised and other unsupervised methods: In Table  3 we compare our unsupervised learned features against other unsupervised (or hand-crafted) features on CIFAR-10. For our entries we use the feature maps generated by the 2nd conv. block of a RotNet model with 4 conv. blocks in total. On top of those RotNet features we train 2 different classifiers: (a) a non-linear classifier with 3 fully connected layers as before (entry (Ours) RotNet + non-linear ), and (b) three conv. layers plus a linear prediction layer (entry (Ours) RotNet +conv. ; note that this entry is basically a 3 blocks NIN model with the first 2 blocks coming from a RotNet model and the 3rd being randomly initialized and trained on the recognition task). We observe that we improve over the prior unsupervised approaches and we achieve state-of-the-art results in CIFAR-10 (note that each of the prior approaches has a different ConvNet architecture thus the comparison with them is just indicative). More notably, the accuracy gap between the RotNet based model and the fully supervised NIN model is very small , only 1.64 percentage points (92.80% vs 91.16%). We provide per class breakdown of the classification accuracy of our unsupervised model as well as the supervised one in Table  9 (in appendix  B ). In Table  3 we also report the performance of the RotNet features when, instead of being kept frozen, they are fine-tuned during the object recognition training phase. We observe that fine-tuning the unsupervised learned features further improves the classification performance, thus reducing even more the gap with the supervised case.

Refer to caption

Correlation between object classification task and rotation prediction task: In Figure  9(a) , we plot the object classification accuracy as a function of the training epochs used for solving the self-supervised task of recognizing rotations, which learns the features used by the object classifier. More specifically, in order to create the object recognition accuracy curve, in each training snapshot of RotNet (i.e., every 20 epochs), we pause its training procedure and we train from scratch (until convergence) a non-linear object classifier on top of the so far learnt RotNet features. Therefore, the object recognition accuracy curve depicts the accuracy of those non-linear object classifiers after the end of their training while the rotation prediction accuracy curve depicts the accuracy of the RotNet at those snapshots. We observe that, as the ability of the RotNet features for solving the rotation prediction task improves (i.e., as the rotation prediction accuracy increases), their ability to help solving the object recognition task improves as well (i.e., the object recognition accuracy also increases). Furthermore, we observe that the object recognition accuracy converges fast w.r.t. the number of training epochs used for solving the pretext task of rotation prediction.

Semi-supervised setting: Motivated by the very high performance of our unsupervised feature learning method, we also evaluate it on a semi-supervised setting. More specifically, we first train a 4 block RotNet model on the rotation prediction task using the entire image dataset of CIFAR-10 and then we train on top of its feature maps object classifiers using only a subset of the available images and their corresponding labels. As feature maps we use those generated by the 2nd conv. block of the RotNet model. As a classifier we use a set of convolutional layers that actually has the same architecture as the 3rd conv. block of a NIN model plus a linear classifier, all randomly initialized. For training the object classifier we use for each category 20, 100, 400, 1000, or 5000 image examples. Note that 5000 image examples is the extreme case of using the entire CIFAR-10 training dataset. Also, we compare our method with a supervised model that is trained only on the available examples each time. In Figure  9(b) we plot the accuracy of the examined models as a function of the available training examples. We observe that our unsupervised trained model exceeds in this semi-supervised setting the supervised model when the number of examples per category drops below 1000. Furthermore, as the number of examples decreases, the performance gap in favor of our method is increased. This empirical evidence demonstrates the usefulness of our method on semi-supervised settings.

3.2 Evaluation of self-supervised features trained in ImageNet

Here we evaluate the performance of our self-supervised ConvNet models on the ImageNet, Places, and PASCAL VOC datasets. Specifically, we first train a RotNet model on the training images of the ImageNet dataset and then we evaluate the performance of the self-supervised features on the image classification tasks of ImageNet, Places, and PASCAL VOC datasets and on the object detection and object segmentation tasks of PASCAL VOC.

Implementation details: For those experiments we implemented our RotNet model with an AlexNet architecture. Our implementation of the AlexNet model does not have local response normalization units, dropout units, or groups in the colvolutional layers while it includes batch normalization units after each linear layer (either convolutional or fully connected). In order to train the AlexNet based RotNet model, we use SGD with batch size 192 192 192 , momentum 0.9 0.9 0.9 , weight decay 5 ​ e − 4 5 𝑒 4 5e-4 and l ​ r 𝑙 𝑟 lr of 0.01. We drop the learning rates by a factor of 10 10 10 after epochs 10, and 20 epochs. We train in total for 30 epochs. As in the CIFAR experiments, during training we feed the RotNet model all four rotated copies of an image simultaneously (in the same mini-batch).

ImageNet classification task: We evaluate the task generalization of our self-supervised learned features by training on top of them non-linear object classifiers for the ImageNet classification task (following the evaluation scheme of  (Noroozi & Favaro, 2016 ) ). In Table  4 we report the classification performance of our self-supervised features and we compare it with the other unsupervised approaches. We observe that our approach surpasses all the other methods by a significant margin . For the feature maps generated by the Conv4 layer, our improvement is more than 4 percentage points and for the feature maps generated by the Conv5 layer, our improvement is even bigger, around 8 percentage points. Furthermore, our approach significantly narrows the performance gap between unsupervised features and supervised features. In Table  5 we report similar results but for linear (logistic regression) classifiers (following the evaluation scheme of  Zhang et al. ( 2016a ) ). Again, our unsupervised method demonstrates significant improvements over prior unsupervised methods.

Transfer learning evaluation on PASCAL VOC: In Table  7 we evaluate the task and dataset generalization of our unsupervised learned features by fine-tuning them on the PASCAL VOC classification, detection, and segmentation tasks. As with the ImageNet classification task, we outperform by significant margin all the competing unsupervised methods in all tested tasks, significantly narrowing the gap with the supervised case. Notably, the PASCAL VOC 2007 object detection performance that our self-supervised model achieves is 54.4 % percent 54.4 54.4\% mAP, which is only 2.4 points lower than the supervised case. We provide the per class detection performance of our method in Table  8 (in appendix  B ).

Places classification task: In Table  6 we evaluate the task and dataset generalization of our approach by training linear (logistic regression) classifiers on top of the learned features in order to perform the 205-way Places classification task. Note that in this case the learnt features are evaluated w.r.t. their generalization on classes that were “unseen” during the unsupervised training phase. As can be seen, even in this case our method manages to either surpass or achieve comparable results w.r.t. prior state-of-the-art unsupervised learning approaches.

4 Conclusions

In our work we propose a novel formulation for self-supervised feature learning that trains a ConvNet model to be able to recognize the image rotation that has been applied to its input images. Despite the simplicity of our self-supervised task, we demonstrate that it successfully forces the ConvNet model trained on it to learn semantic features that are useful for a variety of visual perception tasks, such as object recognition, object detection, and object segmentation. We exhaustively evaluate our method in various unsupervised and semi-supervised benchmarks and we achieve in all of them state-of-the-art performance. Specifically, our self-supervised approach manages to drastically improve the state-of-the-art results on unsupervised feature learning for ImageNet classification, PASCAL classification, PASCAL detection, PASCAL segmentation, and CIFAR-10 classification, surpassing prior approaches by a significant margin and thus drastically reducing the gap between unsupervised and supervised feature learning.

5 Acknowledgements

This work was supported by the ANR SEMAPOLIS project, an INTEL gift, and hardware donation by NVIDIA.

  • Agrawal et al. (2015) Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision , pp.  37–45, 2015.
  • Bengio et al. (2007) Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems , pp. 153–160, 2007.
  • Bojanowski & Joulin (2017) Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. arXiv preprint arXiv:1704.05310 , 2017.
  • Doersch & Zisserman (2017) Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. CoRR , abs/1708.07860, 2017.
  • Doersch et al. (2015) Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision , pp.  1422–1430, 2015.
  • Donahue et al. (2016) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782 , 2016.
  • Dosovitskiy et al. (2014) Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems , pp. 766–774, 2014.
  • Everingham et al. (2010) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision , 88(2):303–338, June 2010.
  • Girshick (2015) Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision , pp.  1440–1448, 2015.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems , pp. 2672–2680, 2014.
  • Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 , 2017.
  • Huang et al. (2007) Fu Jie Huang, Y-Lan Boureau, Yann LeCun, et al. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on , pp.  1–8. IEEE, 2007.
  • Karpathy & Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.  3128–3137, 2015.
  • Krähenbühl et al. (2015) Philipp Krähenbühl, Carl Doersch, Jeff Donahue, and Trevor Darrell. Data-dependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856 , 2015.
  • Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
  • Larsson et al. (2016) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. In European Conference on Computer Vision , pp.  577–593. Springer, 2016.
  • Larsson et al. (2017) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. arXiv preprint arXiv:1703.04044 , 2017.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278–2324, 1998.
  • Liao et al. (2016) Renjie Liao, Alex Schwing, Richard Zemel, and Raquel Urtasun. Learning deep parsimonious representations. In Advances in Neural Information Processing Systems , pp. 5076–5084, 2016.
  • Lin et al. (2013) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400 , 2013.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June 2015.
  • Masci et al. (2011) Jonathan Masci, Ueli Meier, Dan Cireşan, and Jürgen Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. Artificial Neural Networks and Machine Learning–ICANN 2011 , pp.  52–59, 2011.
  • Noroozi & Favaro (2016) Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision , pp.  69–84. Springer, 2016.
  • Noroozi et al. (2017) Mehdi Noroozi, Hamed Pirsiavash, and Paolo Favaro. Representation learning by learning to count. arXiv preprint arXiv:1708.06734 , 2017.
  • Oyallon & Mallat (2015) Edouard Oyallon and Stéphane Mallat. Deep roto-translation scattering for object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.  2865–2873, 2015.
  • Oyallon et al. (2017) Edouard Oyallon, Eugene Belilovsky, and Sergey Zagoruyko. Scaling the scattering transform: Deep hybrid networks. arXiv preprint arXiv:1703.08961 , 2017.
  • Pathak et al. (2016a) Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, and Bharath Hariharan. Learning features by watching objects move. arXiv preprint arXiv:1612.06370 , 2016a.
  • Pathak et al. (2016b) Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.  2536–2544, 2016b.
  • Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3):211–252, 2015.
  • Wang & Gupta (2015) Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision , pp.  2794–2802, 2015.
  • Yang et al. (2016) Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep representations and image clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp.  5147–5156, 2016.
  • Zhang et al. (2016a) Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. In European Conference on Computer Vision , pp.  649–666. Springer, 2016a.
  • Zhang et al. (2016b) Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. arXiv preprint arXiv:1611.09842 , 2016b.
  • Zhou et al. (2014) Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems 27 , pp.  487–495. Curran Associates, Inc., 2014.

Appendix A Visualizing attention maps of rotated images

Here we visualize the attention maps generated by an AlexNet model trained on the self-supervised task of rotation recognition for all the rotated copies of a few images. We observe that the attention maps of all the rotated copies of an image are roughly the same, i.e., the attention maps are equivariant w.r.t. the image rotations. This practically means that in order to accomplish the rotation prediction task the network focuses on the same object parts regardless of the image rotation.

Attention maps of Conv3 feature maps (size: 13 × 13 13 13 13\times 13 )

Refer to caption

Attention maps of Conv5 feature maps (size: 6 × 6 6 6 6\times 6 )

Refer to caption

Appendix B Per class breakdown of detection and classification performance

In Tables  8 and  9 we report the per class performance of our unsupervised learning method on the PASCAL detection and CIFAR-10 classification tasks respectively.

ar5iv homepage

The world is getting “smarter” every day, and to keep up with consumer expectations, companies are increasingly using machine learning algorithms to make things easier. You can see them in use in end-user devices (through face recognition for unlocking smartphones) or for detecting credit card fraud (like triggering alerts for unusual purchases).

Within  artificial intelligence  (AI) and  machine learning , there are two basic approaches: supervised learning and unsupervised learning. The main difference is that one uses labeled data to help predict outcomes, while the other does not. However, there are some nuances between the two approaches, and key areas in which one outperforms the other. This post clarifies the differences so you can choose the best approach for your situation.

Supervised learning  is a machine learning approach that’s defined by its use of labeled data sets. These data sets are designed to train or “supervise” algorithms into classifying data or predicting outcomes accurately. Using labeled inputs and outputs, the model can measure its accuracy and learn over time.

Supervised learning can be separated into two types of problems when  data mining : classification and regression:

  • Classification  problems use an algorithm to accurately assign test data into specific categories, such as separating apples from oranges. Or, in the real world, supervised learning algorithms can be used to classify spam in a separate folder from your inbox. Linear classifiers, support vector machines, decision trees and  random forest  are all common types of classification algorithms.
  • Regression  is another type of supervised learning method that uses an algorithm to understand the relationship between dependent and independent variables. Regression models are helpful for predicting numerical values based on different data points, such as sales revenue projections for a given business. Some popular regression algorithms are linear regression, logistic regression, and polynomial regression.

Unsupervised learning  uses machine learning algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns in data without the need for human intervention (hence, they are “unsupervised”).

Unsupervised learning models are used for three main tasks: clustering, association and dimensionality reduction:

  • Clustering  is a data mining technique for grouping unlabeled data based on their similarities or differences. For example, K-means clustering algorithms assign similar data points into groups, where the K value represents the size of the grouping and granularity. This technique is helpful for market segmentation, image compression, and so on.
  • Association  is another type of unsupervised learning method that uses different rules to find relationships between variables in a given data set. These methods are frequently used for market basket analysis and recommendation engines, along the lines of “Customers Who Bought This Item Also Bought” recommendations.
  • Dimensionality reduction  is a learning technique that is used when the number of features (or dimensions) in a given data set is too high. It reduces the number of data inputs to a manageable size while also preserving the data integrity. Often, this technique is used in the preprocessing data stage, such as when autoencoders remove noise from visual data to improve picture quality.

The main distinction between the two approaches is the use of labeled data sets. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not.

In supervised learning, the algorithm “learns” from the training data set by iteratively making predictions on the data and adjusting for the correct answer. While supervised learning models tend to be more accurate than unsupervised learning models, they require upfront human intervention to label the data appropriately. For example, a supervised learning model can predict how long your commute will be based on the time of day, weather conditions and so on. But first, you must train it to know that rainy weather extends the driving time.

Unsupervised learning models, in contrast, work on their own to discover the inherent structure of unlabeled data. Note that they still require some human intervention for validating output variables. For example, an unsupervised learning model can identify that online shoppers often purchase groups of products at the same time. However, a data analyst would need to validate that it makes sense for a recommendation engine to group baby clothes with an order of diapers, applesauce, and sippy cups.

  • Goals:  In supervised learning, the goal is to predict outcomes for new data. You know up front the type of results to expect. With an unsupervised learning algorithm, the goal is to get insights from large volumes of new data. The machine learning itself determines what is different or interesting from the data set.
  • Applications: Supervised learning models are ideal for spam detection, sentiment analysis, weather forecasting and pricing predictions, among other things. In contrast, unsupervised learning is a great fit for anomaly detection, recommendation engines, customer personas and medical imaging.
  • Complexity:  Supervised learning is a simple method for machine learning, typically calculated by using programs like R or Python. In unsupervised learning, you need powerful tools for working with large amounts of unclassified data. Unsupervised learning models are computationally complex because they need a large training set to produce intended outcomes.
  • Drawbacks: Supervised learning models can be time-consuming to train, and the labels for input and output variables require expertise. Meanwhile, unsupervised learning methods can have wildly inaccurate results unless you have human intervention to validate the output variables.

Choosing the right approach for your situation depends on how your data scientists assess the structure and volume of your data, as well as the use case. To make your decision, be sure to do the following:

  • Evaluate your input data:  Is it labeled or unlabeled data? Do you have experts that can support extra labeling?
  • Define your goals:  Do you have a recurring, well-defined problem to solve? Or will the algorithm need to predict new problems?
  • Review your options for algorithms:  Are there algorithms with the same dimensionality that you need (number of features, attributes, or characteristics)? Can they support your data volume and structure?

Classifying big data can be a real challenge in supervised learning, but the results are highly accurate and trustworthy. In contrast, unsupervised learning can handle large volumes of data in real time. But, there’s a lack of transparency into how data is clustered and a higher risk of inaccurate results. This is where semi-supervised learning comes in.

Can’t decide on whether to use supervised or unsupervised learning?  Semi-supervised learning  is a happy medium, where you use a training data set with both labeled and unlabeled data. It’s particularly useful when it’s difficult to extract relevant features from data—and when you have a high volume of data.

Semi-supervised learning is ideal for medical images, where a small amount of training data can lead to a significant improvement in accuracy. For example, a radiologist can label a small subset of CT scans for tumors or diseases so the machine can more accurately predict which patients might require more medical attention.

Machine learning models are a powerful way to gain the data insights that improve our world. To learn more about the specific algorithms that are used with supervised and unsupervised learning, we encourage you to delve into the Learn Hub articles on these techniques. We also recommend checking out the blog post that goes a step further, with a detailed look at deep learning and neural networks.

  • What is Supervised Learning?
  • What is Unsupervised Learning?
  • AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the difference?

To learn more about how to build machine learning models, explore the free tutorials on the  IBM® Developer Hub .

Get the latest tech insights and expert thought leadership in your inbox.

The Data Differentiator: Learn how to weave a single technology concept into a holistic data strategy that drives business value.

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

  • Open access
  • Published: 08 May 2024
  • Volume 56 , article number  168 , ( 2024 )

Cite this article

You have full access to this open access article

unsupervised visual representation learning by context prediction

  • Jingyu Zhao 1 ,
  • Ruwei Li 1 ,
  • Maocun Tian 1 &
  • Weidong An 1  

223 Accesses

Explore all metrics

To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6 \(\%\) reduction in character error rate, indicating significantly improved speech recognition performance . These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks.

Similar content being viewed by others

unsupervised visual representation learning by context prediction

Deep neural networks for automatic speech processing: a survey from large corpora to limited data

unsupervised visual representation learning by context prediction

MESRS: Models Ensemble Speech Recognition System

unsupervised visual representation learning by context prediction

Automatic speech recognition: a survey

Avoid common mistakes on your manuscript.

1 Introduction

Automatic Speech Recognition (ASR) technology plays a pivotal role in facilitating human-computer interaction by converting speech signals into text [ 1 ]. Indeed, ASR technology built on deep learning has made significant strides in recent years [ 2 ]. However, as people’s demands for accuracy and robustness in ASR models continue to grow, there are challenges in meeting these requirements. While the development of hybrid deep neural network models (DNNs) [ 3 ], encompassing acoustic, linguistic, and lexical models, has led to improved accuracy in automatic voice recognition, these models involve multiple modules and a tedious training procedure. Each module requires independent tuning, which can result in cumulative errors in the overall model. In response to these challenges, the field of voice recognition has undergone a noteworthy shift from hybrid models towards end-to-end (E2E) models [ 4 , 5 ]. The E2E speech recognition model employs a single network to directly transform input speech sequences into output token sequences. By merging the acoustic model and linguistic model from traditional speech recognition into a unified network, the E2E model effectively simplifies the structure of the speech recognition process. This transition to end-to-end models brings the advantage of streamlining the ASR model, reducing complexity, and potentially improving overall performance and robustness. As research continues in this direction, we can anticipate further advancements in ASR technology, ultimately catering to the increasing demands of diverse applications and enhancing the quality of life for users.

Currently, there are several research directions in the field of end-to-end speech recognition: connectionist temporal classification (CTC) method [ 6 , 7 , 8 ], recurrent neural network transducers (RNN-T) [ 9 ], and attention-based models (AED) [ 10 ]. These end-to-end (E2E) models treat automatic speech recognition (ASR) as a sequence-to-sequence problem, where a neural network is directly employed to learn the mapping from speech to text. The CTC method has been extensively researched due to its straightforward modeling process, which involves only an encoder and outputs each token independently. However, its identification accuracy is often subpar because it assumes conditional independence between output tokens, and decoding speed is fast. The RNN-T model comprises two networks: an encoder that maps input acoustic frames to a higher level for characterization and a prediction and union network that forms the decoder [ 11 ]. This decoder network utilizes autoregression, relying on past prediction data. However, RNN-T training can be unstable, requiring more memory and potentially limiting training speed. Consequently, the resulting model may be less effective at recognizing objects and more challenging for businesses to implement. Many advanced ASR systems [ 12 ] are based on the AED model, which incorporates an encoder for encoding acoustic data and a decoder for generating the most likely word sequence or sentence.While this model considers both previously generated tokens and the acoustic context when producing tokens, it can lead to identification delays. Moreover, the estimated alignment in the attention-based process is vulnerable to noise corruption, much like in real-world speech recognition tasks, resulting in subpar recognition performance for the model. As ASR technology continues to evolve, researchers are actively exploring ways to enhance the capabilities of end-to-end speech recognition models, aiming to strike a balance between accuracy, efficiency, and robustness for practical applications.

The combination of CTC-Attention model has emerged as the dominant approach for end-to-end speech recognition systems [ 13 , 14 ]. This model utilizes a multi-task learning framework and is trained with both CTC and Attention model objectives. The architecture consists of a shared encoder, a CTC layer, and an attention decoder. The shared encoder employs transformer [ 15 ] or conformer [ 16 ] blocks to effectively learn local and global properties of the input speech sequences, enhancing the model’s ability to capture relevant information. The CTC linear and log-softmax layers use the CTC loss function during training to optimize the softmax output. The CTC layer operates in streaming mode for the first channel, allowing for real-time streaming results. The attention decoder, consisting of transformer blocks, generates improved contextual representations and is utilized for the second channel during decoding. The attention-based decoder re-scores multiple candidate outcomes N-best in a teacher-forced manner, enabling more precise results during decoding. The recognized phrases are then reranked based on the scores, further improving recognition accuracy. Researchers have discovered that the combination of CTC loss and AED leads to faster training convergence and superior recognition results. As a result, this approach has become the standard reference scheme for training end-to-end speech recognition models. However, existing end-to-end speech recognition models face limitations in mining supervised information from vast amounts of unsupervised data. They primarily focus on the output characteristics of the last layer of the encoder and overlook inter-layer information. This limitation leaves room for improvement in model characterization, data utilization, and model resilience. Continued research in these areas presents opportunities for advancing end-to-end speech recognition systems, ultimately leading to more powerful and efficient models that can better utilize unsupervised data and improve recognition performance in various applications

Based on the latest research developments, we propose an innovative end-to-end speech recognition model that combines multi-scale feature fusion with multi-view self-supervised learning. The model is trained using a hybrid strategy, incorporating both supervised and self-supervised training approaches. The primary focus of the model is on leveraging the inter-layer information of the shared encoder to enhance its characterization capability. By utilizing the diversity of this information, the model becomes more adept at representing speech data accurately. Additionally, the model incorporates multi-view self-supervised learning, which maximizes the utilization of data information and improves the model’s resilience. This is achieved by creating various shared encoder sub-models, each excluding some information, and then using multi-view self-supervised learning to effectively exploit the data. The shared encoder consists of multiple conformer blocks, allowing it to learn both local and global features of the input speech sequence. The multi-scale feature fusion module (MFF) plays a crucial role in the model, providing different weights for the output of various conformer blocks and combining these weights to generate the final output representation. The outputs of each conformer block are then stitched together to form the overall representation. The model’s decoding process involves using both the CTC and Attention decoders on the output representation. To validate the performance of the proposed model, we use WeNet [ 17 , 18 ], a speech recognition tool, as the benchmark, and the Aishell-1 [ 19 ] dataset for training and testing. Subsequently, it was further tested on the English corpus WSJ. The experimental results demonstrate the significant reduction in character error rate and improved speech recognition performance when compared to the baseline, employing four different decoding techniques. This confirms the effectiveness and potential of the proposed end-to-end speech recognition model, showcasing its capability to enhance voice recognition accuracy and performance.

2 Related Work

Based on different training objectives, SSL methods can be categorized into generative learning, discriminative learning, and multi-task learning. The research line of generative learning can be traced back to the auto-encoding model, which reconstructs the entire speech from continuous [ 20 , 21 , 22 ] or discrete [ 23 ] latent variables. Recent works propose to predict future frames from the history with an autoregressive model [ 24 , 25 , 26 , 27 ], or recover the masked frames from the corrupted speech with a non-autoregressive model [ 28 , 29 , 30 , 31 , 32 ]. Apart from generative learning, discriminative learning has also gathered interests recently. The well-known examples include CPC [ 33 ], wav2vec [ 34 ], vq-wav2vec [ 35 ], wav2vec 2.0 [ 36 ], DiscreteBERT [ 37 ], and HuBERT [ 38 ]. However, self-supervised paradigms require careful design, and such representations can be difficult to interpret. There is no guarantee that the model will learn a "good" speech representation in terms of identifying the most valuable information.

Convolutional neural networks (CNN) have been proven to be a useful model for handling various visual tasks [ 39 , 40 , 41 , 42 ]. Despite their great success, CNNs still have their limitations. They mainly focus on local spatial modeling and lack global context fusion. Models based on CNNs cannot handle long-range dependencies well. Recently, in the field of speech processing, ECAPATDNN [ 43 ] and its follow-up efforts [ 44 , 45 ] achieved a significant breakthrough based on TDNN blocks and the squeeze-and-excitation (SE) [ 46 ] layer unified with Res2Block [ 47 ]. They provided an equal error rate of less than 1 \(\%\) on the VoxCeleb 1-O benchmark test. Among them, MFA-Conformer [ 48 ], which is based on multi-scale feature fusion, has achieved remarkable results in the speaker recognition task. However, the application of multi-scale feature fusion in speech recognition tasks is still rare.

Inspired by these recent advancements, we propose an innovative end-to-end speech recognition model that combines multi-scale feature fusion with multi-view self-supervised learning. The model uses a mixed training strategy that encompasses both supervised and self-supervised learning methods.

3 The Overall Architecture of MM-ASR

Figure  1 depicts the overall layout of the multi-view self-supervised learning and multi-scale feature fusion end-to-end speech recognition model developed in this research. The model is built on a common joint CTC-Attention model with conformer blocks for the shared encoder and self-supervised loss construction by contrastive learning. It also includes a self-attentive mechanism for the multi-scale feature fusion module, a CTC layer, and an attention decoder made up of transformer blocks for the decoder.

figure 1

MM-ASR model architecture

3.1 Conformer Structure

The architecture of the network proposed in this study integrates both Convolutional Neural Networks (CNN) and the Transformer model to extract vocal representations. While CNNs are known for their effectiveness in extracting local properties, they often fall short in capturing global properties. The self-attention module, on the other hand, is proficient in capturing long-range global context dependencies, thereby compensating for the CNN’s inability to capture global features. Hence, the Transformer network is incorporated to tackle this shortcoming. The network configuration of the encoder used in this study is shown in Fig.  2 , composed of N layers of identical Conformer blocks [ 16 ].

The network is organized as a stack of four modules, each employing a residual connection structure [ 49 ]. These modules include the feedforward module, the multi-head self-attention (MHSA) module, the convolutional module, and a second feedforward module. The MHSA and the convolution module represent the core components of the Conformer block. The MHSA utilizes the relative position encoding scheme as proposed in the Transformer-XL model [ 50 ], which encodes the input considering the relative position deviation. It takes into account both the global content offset and the global position offset.

Following the MHSA is the convolutional module, which comprises Pointwise convolution, Depthwise convolution, and Glu and Swish activation layers. To assist in learning local features and facilitate the training of deep learning models, a BatchNorm layer is placed after the convolutional layer. Mathematically, for the input \(x_i\) of Conformer block i, the output \(y_i\) of Conformer block can be expressed as:

where FFN refers to feed-forward module, MHSA refers to multi-headed self-attentive module, Conv refers to convolution module, and LN refers to layer parametric module.

figure 2

Conformer model structure diagram

3.2 Shared Encoder Based on Multi-view self-supervised Learning

Supervised learning is a deep learning approach that identifies a functional relationship between input and output by categorizing or regressing labeled data. However, it cannot fully exploit the data as it only learns from labeled data. In contrast, self-supervised learning is a potent technique for extracting applicable and generalizable latent representations from large volumes of unlabeled data. This approach is commonly employed in sequence-to-sequence (seq2seq) model pre-training and in facilitating downstream tasks [ 51 , 52 , 53 ]. Through auxiliary or pretext tasks, the network is trained to acquire representations that are beneficial for downstream tasks, mining its supervised knowledge from large-scale unsupervised data.

Based on the above analysis, this study designs a shared encoder leveraging multi-view self-supervised learning. Figure  3 illustrates the network structure of this encoder. The green section in Fig.  3 denotes the encoder employing N layers of identical Conformer blocks to more efficiently capture speech features. The units that are randomly dropped during the training phase are depicted in the blue portion of the multi-view self-supervised learning slab. The self-supervised learning slab employs the dropout regularization technique [ 54 ] to construct two distinct encoder views, thereby reducing the model’s generalization error. The dropout algorithm specifically randomly discards some units in each layer of the neural network to prevent co-adaptation and overfitting. This study uses a self-supervised approach to regularize the output prediction of the sub-model, leveraging the structural randomness introduced by the dropout process. The outputs of the encoder views are compared to extract more reliable characterization information. To better exploit the data and enhance the robustness of the model, the supervised loss is coupled with the self-supervised contrastive loss.

Given the shared encoder input data \(x_i\) , \(x_i\) is fed twice during each training cycle in order to pass through the network’s forward. As a result, it is possible to derive two distributions of the common encoder output, designated as \(P_1\left( y_i \mid x_i\right) \) and \(P_2\left( y_i \mid x_i\right) \) . \(D_{k L}\left( P_1\left( y_i \mid x_i\right) \Vert P_2\left( y_i \mid x_i\right) \right) \) gives the Kullback-Leibler (KL) scatter between \(P_1\left( y_i \mid x_i\right) \) and \(P_2\left( y_i \mid x_i\right) \) . Since the discard operation randomly discards the units in the shared encoder, the forward pass is carried out using two different shared encoder views of the same encoder, as was previously indicated. The self-supervised method utilized in this study then regularizes the model predictions during the training stage by minimizing the bidirectional Kullback–Leibler (KL) dispersion between the same batches \(P_1\left( y_i \mid x_i\right) \) and \(P_2\left( y_i \mid x_i\right) \) as follows:

figure 3

Structure diagram of a shared encoder based on multi-view self-supervised learning

3.3 Multi-scale Feature Fusion Module

In existing speech recognition models, the diversity of information between different layers is often overlooked, limiting their ability to represent the data. When the final speech representation is extracted by the encoder, they only pass the features output from the last layer to the decoder. This study proposes an attention-based multi-scale feature fusion module (MFF) to address this issue by maximizing the utilization of inter-layer information to enhance the model representation information capabilities.

Based on the analysis, the scale information is extracted by each conformer block of each layer in the shared encoder, and there is a reliance between the scale information of the different layers. In this work, we explicitly model the dependencies between each conformer block using the proposed multi-scale feature fusion module. After learning these dependencies, we sum the output of each conformer block and use the scale information extracted from each layer to form N-dimensional features. This process results in acoustic features with stronger characterization information. The structure of this module is depicted in Fig.  4 .

figure 4

MFF structure diagram

The implementation process of the multi-scale feature fusion module involves the following steps: The output from each conformer block is first combined into \(X \in {\mathbb {R}}^{C \times H \times W}\) . After X is transformed into the matrix \(A \in {\mathbb {R}}^{C \times N}\) and subjected to transposition, matrix multiplication, and softmax operations, the attention map \(V \in {\mathbb {R}}^{C \times C}\) is produced:

where the impact of the ith conformer block on the jth conformer block of the metric is indicated by the notation \(v_{j i}\) . The output of dimension (C \(\times \) N) is then molded into (C \(\times \) H \(\times \) W) by performing matrix multiplication of this attention graph V with matrix A. After learning the dependency relationship, the result is multiplied by the scale factor \(\beta \) , and an element-by-element summing operation is then carried out with X to generate the output of each conformer block \(Y \in {\mathbb {R}}^{C \times H \times W}\) :

where \(\beta \) is initialized to 0 and gradually learns to assign larger weights. In Eq. ( 7 ), the process of the multi-scale feature fusion module is described as follows: The weighted sum of all conformer block output features and the initial conformer block output features, represents the resultant features of each conformer block after learning the dependencies. This module models the dependencies between different conformer blocks, which helps to obtain more robust speech representations. The final acoustic representation, provided to the decoder, is generated by aggregating the outputs of each conformer block after learning their dependencies. Through this process of weighted summation and integration of information from multiple conformer blocks, the end-to-end speech recognition model is empowered to effectively represent and comprehend complex speech patterns. This enhances the model’s overall capability to achieve accurate and robust speech recognition.

where \(y_c\) is the final output of the multi-scale feature fusion module.

3.4 Decoder

The Connectionist Temporal Classification (CTC) method, developed by Graves et al. [ 6 ], is a technique primarily used to address the problem of output alignment between labels and neural network predictions.

To determine the likelihood of the CTC target sequence, the CTC model takes into account all feasible alignment routes between the target sequence y and the input sequence x. This likelihood is specified as:

where q is one of the pathways, and \(\beta ^{-1}(y)\) is the set of all paths that could map from the input sequence to the output label. Equation ( 10 ) illustrates the definition of the CTC loss function as the sum of the negative log probability of obtaining the appropriate label during training.

Therefore, the CTC method significantly simplifies the training and modeling processes for speech recognition models. In this study, we use the CTC model as one of the decoders. The CTC model’s architecture comprises linear and log-softmax layers. During the training phase, we apply the CTC loss function to the softmax output, which helps to transform the output of the shared encoder after the MMF activation into the CTC model.

The Attention Decoder in this paper is made up of many similar Transformer blocks. Wherein, the Multi-Head Cross-Attention module (MHCA), is added to the Feedforward and Self-Attention modules in order to execute multi-head attention on the output of the shared encoder after passing through the MMF. The attention decoder in this study uses relative position encoding to be consistent with the shared encoder. Mathematically, the output \(y_i\) for input \(x_i\) of transformer block i in the attention decoder can be written as follows:

where the shared encoder output following the MMF is referred to as \({\widetilde{y}}\) . MHSA stands for multi-headed self-attentive module, MHCA for multi-headed cross-attentive module, and LN for layer norm module, where FFN is for feedforward module.

3.5 Multi-task Learning Paradigm

The model proposed in this study employs two supervised losses, namely the Connectionist Temporal Classification (CTC) loss and the Attention-based Encoder-Decoder (AED) loss, in addition to a self-supervised comparison loss. The training process follows a hybrid end-to-end approach that combines both supervised and self-supervised training methods. By integrating both CTC and AED losses into one of the supervised losses, the model benefits from improved convergence while fully capturing token dependencies within the data. Equations ( 14 ) and ( 15 ) define the joint supervised and self-supervised losses, where x is the acoustic feature and y is the corresponding label. The CTC decoder and attention decoder losses are denoted by the variables \(L_{C T C}(x, y)\) , \(L_{A E D}(x, y)\) , and \(\lambda \in (0,1)\) , which is the hyperparameter that balances the weights of the two losses. are hyperparameters that weigh the significance of losses that are both supervised and self-supervised.

3.6 Analyze

Compared with supervised learning, self-supervised learning methods attempt to learn powerful contextual representations from audio data only, and then fine-tune the model on paired data. Currently, there are some pre-trained models that achieve excellent performance, but these require a large amount of external data and model parameters for training. Moreover, these models mainly address general representations for speech tasks. Specifically, models such as CPC and the wav2vec series use contrastive InfoNCE loss to distinguish between related positive samples and negative samples. Inspired by masked language model loss in NLP, DiscreteBERT and HuBERT predict discrete targets in masked regions. However, our method focuses on an end-to-end ASR model that requires only a small amount of labeled data for training and achieves excellent performance through the proposed multi-view contrastive self-supervised approach.

The multi-scale feature fusion network structure is relatively flexible and there is no clear boundary. The receptive field of the high-level network is relatively large, and the semantic information representation ability is strong, but the resolution of the feature map is low, and the geometric information representation ability is weak. The receptive field of the low-level network is relatively small, and the geometric detail information representation ability is strong. Although the resolution is high, the semantic information representation ability is weak. The multi-scale feature fusion network makes the model easier to achieve significant results on complex tasks by fusing deep and shallow layer features. The latest research has demonstrated the potential of voice models on full-stack voice tasks by using the weighted sum of embeddings from different layers. They found that different layers contain useful information for different tasks. For example, the top hidden states are useful for ASR, while the bottom layers are more effective for speaker verification. Therefore, this study proposes an attention-based multi-scale feature fusion module (MFF) to enhance the model’s ability to represent information by maximizing inter-layer information utilization.

4 Performance Testing and Analysis

We first demonstrate our results on the Aishell-1 test dataset to gain a deeper understanding of our method. Subsequently, we further validate the effectiveness of the method on the English corpus WSJ (80-h). To evaluate the effectiveness of the multi-scale feature fusion method and the multi-view self-supervised learning module, we conducted ablation experiments to compare the differences. The performance of the model is evaluated based on the character error rate (CER).

4.1 Dataset

The Aishell company provides the Aishell-1 dataset, an open-source speech dataset that resamples high-fidelity microphone audio data to 16 kHz, 16-bit WAV format. The dataset consists of speech data from 400 speakers, representing diverse dialect regions in China, and covers a wide range of topics such as technology, sports, entertainment, current news, finance, and economics. The Aishell-1 dataset is divided into three sets: a training set with 340 speakers, containing 150 h of speech data, a validation set with 40 speakers, comprising 10 h of speech data, and a test set with 20 speakers, containing 5 h of speech data. In total, the dataset contains 165 h of speech data. The composition of the Aishell-1 dataset is detailed in Table  1 . The test set consists of 7176 speech samples. For this project, the Aishell-1 dataset was utilized for both training and testing the proposed speech recognition model.

4.2 Experimental Setup

The test configuration for this experiment includes an AMD R9-3090X processor, 32 GB of RAM, and an NVIDIA RTX-3090 GPU graphics card. The software environment is a 64-bit Ubuntu 20.04 operating system running the Pytorch deep learning framework.

The input features consist of an 80-dimensional log-Mel filter bank (Fbank) with a 25-ms window and a 10-ms shift. We perform speed perturbation on the entire data at 0.9, 1.0, and 1.1 speeds to generate a 3x speed variation. SpecAugment is applied with 2 frequency masks with maximum frequency mask (F = 10) and 2 time masks with maximum time mask (T = 50).

To reduce the computational burden, a two-dimensional convolutional down-sampling technique is employed at the front end of the shared encoder. The kernel size is 3*3, the stride is 2, which means a total of 4 subsampling operations. The shared encoder comprises 12 conformer blocks with four multi-headed attentions, each using 256 attention dimensions and 2048 feedforward dimensions, consistent with the baseline model. The attention decoder includes six transformer blocks with four multi-headed attentions. During joint training and decoding, the weights of the CTC branches are set to 0.3 and 0.5, respectively. Gradient accumulation is used during training to stabilize the process, with gradients updated every 4 batches [ 55 ]. To prevent overfitting, dropout operations and label smoothing regularization are applied to each conformer and transformer block. The Adam optimizer is used for training, with a learning rate schedule of 25,000 warm-up steps and an initial learning rate of 0.002. Additionally, We conducted experiments with different hyperparameters \(\mu \) selected as 0, 0.01, 0.05, 0.1, 1, 10, and different numbers of MFF fusion layers selected as 2, 3, 4, 12.

4.3 Evaluation Metrics

In automatic speech recognition, the results are usually presented as a list of words and phrases. During this process, three types of errors can occur: insertion, deletion, and substitution errors. Insertion errors involve adding an extra word to the recognition result; Deletion errors occur when the correct word is missing from the recognition result and substitution errors replace the correct word in the recognition result with an incorrect word. In English, the recognition success is typically measured in words, and the error rate is referred to as Word Error Rate (WER). For languages like Russian and Viennese, the appropriate evaluation metric is the Word Error Rate (WER) as well. However, in languages like Chinese, word ambiguity is a challenge, making it difficult to directly measure errors in words. Therefore, the Character Error Rate (CER) is commonly used as the evaluation index for Chinese speech recognition, and similar languages like Japanese also employ CER. As the Chinese speech dataset Aishell-1 is employed in this experiment, CER is used as the evaluation index, and its formula is as follows:

where, \(N_{S u b}\) represents the number of words in which a substitution error occurs; \(N_{I n s}\) represents the number of words in which an insertion error occurs; and \(N_{R e f}\) represents the total number of words in the test set. \(N_{D e l}\) represents the number of words for which the recognition result has a deletion error in comparison to the actual annotation. Insertion errors make it possible for CER to be greater than 100 \(\%\) with a minimum of 0.

4.4 Performance Testing and Analysis

The experiments for the multi-scale feature fusion module aim to investigate the impact of fusing the output data from different numbers of conformer blocks on the model’s recognition performance. The experimental results are summarized in Table  2 as follows: B6+B12 in the shared encoder correspond to fusing the output data from the sixth and twelfth conformer blocks. B4+B8+B12 in the shared encoder indicate the fusion of the output data from the fourth, eighth, and twelfth conformer blocks. B3+B6+B9+B12 in the shared encoder represent the fusion of the output data from the third, sixth, ninth, and twelfth conformer blocks. All blocks, as proposed in this work, symbolize the fusion of the output data from every conformer block in the shared encoder. The ablation experiment only focuses on the MFF module without the addition of SSL. The results clearly demonstrate that the recognition performance is positively influenced by the number of fused blocks. And with the increase in the number of fusion blocks, the recognition performance also improves. Specifically, the performance of models with two, three, or four blocks fused is inferior to that of the model with all blocks fused, confirming the importance of incorporating the output data from all conformer blocks for improved recognition performance.

In this study, experiments are carried out for the multi-view self-supervised learning module to examine the impact of the hyperparameter \(\mu \) on the model recognition performance. The experimental results are displayed in Table  3 . When \(\mu =0.05\) , the self-supervised loss and supervised loss are balanced to obtain the best performance, which implies that it is crucial to balance the self-supervised loss and supervised loss in joint training.

In this study, ablation experiments are conducted to demonstrate the effectiveness of the MM-ASR model’s multi-scale feature fusion module and multi-view self-supervised learning method. The experimental results are displayed in Table  4 . The baseline model is the original WeNet model, with the decoder trained in supervised learning mode using features from the network’s final layer. The MM-ASR model, proposed in this paper, incorporates both the multi-scale feature fusion module and multi-view self-supervised learning method. Two additional variants are also evaluated: -SSL, which is the MM-ASR model with the multi-view self-supervised learning method eliminated, and -MFF, which is the MM-ASR model with the multi-scale feature fusion module removed. The experimental results demonstrate the efficacy of both multi-scale feature fusion and multi-view self-supervised learning. The MM-ASR model, which combines supervised and self-supervised losses for training and focuses on interlayer information, exhibits improved model resilience and achieves a lower character error rate (CER) compared to the original WeNet model. The proposed approach leads to a significant enhancement in voice recognition ability, reducing the character error rate by approximately 4.6 \(\%\) when compared to the baseline. This demonstrates the effectiveness of the multi-scale feature fusion and multi-view self-supervised learning techniques in improving the performance of the end-to-end speech recognition model.

Table  5 presents a comparison of the Character Error Rate (CER) results between the MM-ASR model proposed in this study and several widely available models on the Aishell-1 test dataset. The models used for comparison include CTC/Attention [ 56 ], CAT [ 57 ], ESPnet [ 58 ], BAT [ 59 ], Paraformer [ 60 ], UMA [ 61 ] and WeNet [ 17 , 18 ]. All assessment results in the paper are rounded to two decimal places for consistency. The findings in Table  5 demonstrate that the MM-ASR model outperforms the other models, indicating its superior performance in terms of speech recognition accuracy. This clearly demonstrates the effectiveness of multi-scale feature fusion and self-supervised learning within a single neural network. The experimental outcomes provide strong evidence supporting the effectiveness and usefulness of the proposed MM-ASR model for end-to-end speech recognition tasks, confirming its superiority compared to publicly available models like CTC/Attention, CAT, ESPnet, BAT, Paraformer, UMA and WeNet.

Table  6 shows a comparison of character error rate (CER) results between the MM-ASR model proposed in this study and several widely available models on the English corpus WSJ (80-h). The models used for comparison include CTC/attention, CAT, ESPnet, LF-MMI [ 62 ], CTC-CRF ST-NAS [ 63 ], Wav2letter++ [ 64 ], and WeNet. The results in Table  6 demonstrate that on the English corpus WSJ, the MM-ASR model still outperforms other models.

5 Conclusion

In this paper, a combination of supervised and self-supervised training techniques is leveraged to construct and train an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning. The proposed method emphasizes the use of inter-layer information in a shared encoder to improve the model’s ability to represent and process speech data. Self-supervised contrast loss is proposed in the shared encoder section to increase the model’s robustness, and the model is trained by combining supervised and self-supervised loss techniques. Additionally, the multi-view self-supervised learning component and the multi-scale feature fusion module’s ablation experiments are carried out to show their usefulness in the performance of model identification, respectively. Experimental research is done to determine the impact of combining various numbers of conformer blocks and balancing the hyperparameters \(\mu \) of self-supervised loss and supervised loss on the performance of model recognition. The Aishell-1 dataset is used in this study to assess the suggested technique. We further validate the effectiveness of this method on the English corpus WSJ. The experimental findings demonstrate that the strategy enhances the speech recognition model’s performance in terms of recognition.

Availability of data and materials

The data that support the findings of this study are available from the corresponding author, upon reasonable request.

Seltzer ML, Ju Y-C, Tashev I, Wang Y-Y, Yu D (2011) In-car media search. IEEE Signal Process Mag 28(4):50–60. https://doi.org/10.1109/MSP.2011.941065

Article   Google Scholar  

Graves A, Mohamed A-R, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, vancouver, BC, Canada, pp 6645-6649, https://doi.org/10.1109/ICASSP.2013.6638947

Hinton G et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97. https://doi.org/10.1109/MSP.2012.2205597

Wang D, Wang X, Lv S (2019) An overview of end-to-end automatic speech recognition. Symmetry 11(8):1018

Li J (2022) Recent advances in end-to-end automatic speech recognition. APSIPA Trans Signal Inf Process 11(1)

Graves A, Fernández S, Gomez F, et al (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on Machine learning. pp 369-376

Deng K, et al (2022) Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), Singapore, Singapore, pp 8517-8521, https://doi.org/10.1109/ICASSP43922.2022.9747887

Nakagome Y, Komatsu T, Fujita Y, et al (2022) InterAug: augmenting noisy intermediate predictions for CTC-based ASR. arXiv preprint arXiv:2204.00174

Graves A (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711

Kim S, Hori T, Watanabe S (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), New Orleans, LA, USA, pp 4835-4839, https://doi.org/10.1109/ICASSP.2017.7953075

Rao K, Sak H, Prabhavalkar R (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In: (2017) IEEE automatic speech recognition and understanding workshop (ASRU). Okinawa, Japan, pp 193–199. https://doi.org/10.1109/ASRU.2017.8268935

Karita S, Soplin NEY, Watanabe S et al (2019) Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]//Proceedings of the Annual Conference of the International Speech Communication Association. INTERSPEECH. 2019:1408–1412

Google Scholar  

Zhang B, Wu D, Yao Z, et al (2020) Unified streaming and non-streaming two-pass end-to-end model for speech recognition. arXiv preprint arXiv:2012.05481

Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30

Gulati A, Qin J, Chiu CC, et al (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100

Yao Z, Wu D, Wang X, et al (2021) Wenet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv preprint arXiv:2102.01547

Zhang B, Wu D, Peng Z, et al (2022) Wenet 2.0: more productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455

Bu H, Du J, Na X, Wu B, Zheng H (2017) AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline. In: 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), Seoul. Korea (South) 2017, pp 1–5. https://doi.org/10.1109/ICSDA.2017.8384449

Chen Y-C, Huang S-F, Lee H-y, Wang Y-H, Shen C-H (2019) Audio word2vec: sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 27(9):1481–1493

Hsu W-N, Zhang Y, Glass J (2017) Learning latent representations for speech generation and transformation. In: Interspeech, pp 1273–1277

Hsu W N, Zhang Y, Glass J (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. Adv Neural Inf Process Syst 30

Chorowski J, Weiss RJ, Bengio S et al (2019) Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans Audio Speech Lang Process 27(12):2041–2053

Chung Y A, Tang H, Glass J (2020) Vector-quantized autoregressive predictive coding. arXiv preprint arXiv:2005.08392

Chung Y A, Hsu W N, Tang H, et al (2019) An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240

Chung Y A, Glass J (2020) Generative pre-training for speech with autoregressive predictive coding. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE pp 3497-3501

Chung Y A, Glass J (2020) Improved speech representations with multi-target autoregressive predictive coding. arXiv preprint arXiv:2004.05274

Liu A H, Chung Y A, Glass J (2020) Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406

Liu AT, Li SW, Lee H (2021) Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Trans Audio Speech Lang Process 29:2351–2366

Liu A T, Yang S, Chi P H, et al (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6419–6423

Ling S, Liu Y, Salazar J, et al (2020) Deep contextualized acoustic representations for semi-supervised speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6429–6433

Ling S, Liu Y (2020) Decoar 2.0: deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659

Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748

Schneider S, Baevski A, Collobert R, et al (2019) wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862

Baevski A, Schneider S, Auli M (2019) vq-wav2vec: self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453

Baevski A, Zhou Y, Mohamed A et al (2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv Neural Inf Process Syst 33:12449–12460

Baevski A, Mohamed A (2020) Effectiveness of self-supervised pre-training for ASR. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7694–7698

Hsu WN, Bolte B, Tsai YHH et al (2021) Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans Audio Speech Lang Process 29:3451–3460

Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25

Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1653–1660

Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

Ren S, He K, Girshick R et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:91–99

Desplanques B, Thienpondt J, Demuynck K (2020) Ecapatdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143

Thienpondt J, Desplanques B, Demuynck K (2021) Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification. arXiv preprint arXiv:2104.02370

Liu T, Das R K, Lee K A, et al (2022) MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7517–7521

Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp7132–7141

Gao SH, Cheng MM, Zhao K et al (2019) Res2net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell 43(2):652–662

Zhang Y, Lv Z, Wu H, et al (2022) Mfa-conformer: multi-scale feature aggregation conformer for automatic speaker verification. arXiv preprint arXiv:2203.15249

He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

Dai Z, Yang Z, Yang Y, et al (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860

Devlin J, Chang M W, Lee K, et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Chen Z, Zhang Y, Rosenberg A, Ramabhadran B, Wang G, Moreno P (2021) Injecting text in self-supervised speech pretraining. In: IEEE automatic speech recognition and understanding workshop (ASRU). Cartagena, Colombia pp 251–258. https://doi.org/10.1109/ASRU51503.2021.9688018

Srivastava N, Hinton G, Krizhevsky A et al (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

MathSciNet   Google Scholar  

Hermans JR, Spanakis G, Möckel R (2017) Accumulated gradient normalization. In: Asian conference on machine learning. PMLR, pp 439–454

Karita S, et al (2019) A comparative study on transformer vs rnn in speech applications. In: IEEE automatic speech recognition and understanding workshop (ASRU), Singapore, pp 449–456, https://doi.org/10.1109/ASRU46091.2019.9003750

An K, Xiang H, Ou Z (2020) CAT: a CTC-CRF based ASR toolkit bridging the hybrid and the end-to-end approaches towards data efficiency and low latency. arXiv preprint arXiv:2005.13326

Watanabe S, Hori T, Karita S, et al (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015

An K, Shi X, Zhang S (2023) BAT: boundary aware transducer for memory-efficient and low-latency ASR. arXiv preprint arXiv:2305.11571

Gao Z, Li Z, Wang J, et al (2023) FunASR: a fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013

Fang Y, Li X (2023) Unimodal aggregation for CTC-based speech recognition. arXiv preprint arXiv:2309.08150

Hadian H, Sameti H, Povey D, Khudanpur S (2018) Flatstart single-stage discriminatively trained HMM-based models for ASR. IEEE/ACM Trans Audio Speech Lang Process 26(11):1949–1961

Zheng H, An K, Ou Z (2021) Efficient neural architecture search for end-to-end speech recognition via straight-through gradients. In: 2021 IEEE spoken language technology workshop (SLT). IEEE, pp 60–67

Zeghidour N, Xu Q, Liptchinsky V, Usunier N, Synnaeve G, Collobert R (2018) Fully convolutional speech recognition. arXiv preprint arXiv:1812.06864

Download references

Author information

Authors and affiliations.

Faculty of Information Technology, Beijing University of Technology, Beijing, China

Jingyu Zhao, Ruwei Li, Maocun Tian & Weidong An

You can also search for this author in PubMed   Google Scholar

Contributions

The contributions of the authors are as follows: Corresponding author RL designed the algorithm together with JZ, and provided the experimental equipment as well as improved the writing and logic of the paper. JZ verified the algorithm experimentally and wrote the first draft of the paper. MT and Weidong An organized the experimental data, visualized the experimental results, and assisted JZ to complete the experiments. This paper was co-authored by all authors, who have read and approved the final manuscript.

Corresponding author

Correspondence to Ruwei Li .

Ethics declarations

Conflict of interest.

We the authors of this manuscript entitled “Multi-view self-supervised learning and multi-scale feature fusion for automatic speech recognition” declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Informed consent

The submission of this article has been approved by all authors, and the data used in this article has been agreed by the relevant authorities and does not raise issues such as privacy and information security.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Zhao, J., Li, R., Tian, M. et al. Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition. Neural Process Lett 56 , 168 (2024). https://doi.org/10.1007/s11063-024-11614-z

Download citation

Accepted : 06 April 2024

Published : 08 May 2024

DOI : https://doi.org/10.1007/s11063-024-11614-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • End-to-end speech recognition
  • Multi-scale feature fusion
  • Multi-view self-supervised learning
  • Multi-task learning paradigm
  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. Unsupervised Visual Representation Learning by Context Prediction

    The paper proposes to use spatial context as a free supervisory signal for training a rich visual representation. It shows that the learned ConvNet can perform unsupervised visual discovery of objects and improve R-CNN performance.

  2. Unsupervised Visual Representation Learning by Context Prediction

    This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the ...

  3. Unsupervised Visual Representation Learning by Context Prediction

    learning full objects and scenes, and argue that scene-level labels can serve as a pretext task. For example, [13] trains detectors to be sensitive to different geographic locales, but the actual goal is to discover specific elements of architec-tural style. 3. Learning Visual Context Prediction We aim to learn an image representation for our pre-

  4. Unsupervised Visual Representation Learning by Context Prediction

    It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a ...

  5. Unsupervised Visual Representation Learning by Context Prediction

    We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.

  6. Unsupervised Visual Representation Learning by Context Prediction

    This paper proposes a method to learn a rich visual representation from a supervised task of predicting the spatial context of random patches in images. The learned representation is shown to be useful for object detection and discovery, and to generalize across images.

  7. Unsupervised Visual Representation Learning by Context Prediction

    This paper proposes a self-supervised learning method for visual representation using spatial context prediction. The method learns to recognize objects and their parts from unlabeled images and improves object detection performance on PASCAL VOC 2007.

  8. Unsupervised Visual Representation Learning by Context Prediction

    Unsupervised Visual Repr esentation Learning by Context Prediction. Carl Doersch 1,2 Abhinav Gupta 1 Alexei A. Efros 2. 1 School of Computer Science 2 Dept. of Electrical Engineering and Computer ...

  9. ICCV 2015 Open Access Repository

    For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. ... {Unsupervised Visual Representation Learning by Context Prediction}, booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},

  10. PDF Unsupervised Visual Representation Learning by Context Prediction

    Unsupervised Visual Representation Learning by Context Prediction Carl Doersch, Alexei A. Efros, Abhinav Gupta ... • Context prediction becomes a 'pretext' task. A simple way to learn feature vectors for words (Collobert and Weston, 2008) word at t-2 word at t-1 word at t or

  11. Unsupervised Representation Learning By Context Prediction

    This code is based on the paper 'Unsupervised Visual Representation Learning by Context Prediction' by Carl Doersch et al. It trains a visual representation from raw, unlabeled images and can be fine-tuned for vision tasks.

  12. PDF Unsupervised Visual Representation Learning by Context Prediction

    • Predicts the context (n preceding and n succeeding words) of a word • Converts the unsupervised problem of predicting representations into a self-supervised problem of predicting a word's context • Training a neural network for this task generates the embedding of words • But can we use this context idea in image domain?

  13. Unsupervised Visual Representation Learning by Context Prediction

    This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first.

  14. PDF Unsupervised Visual Representation Learning by Context Prediction

    ImageNet + Deep Learning Beagle - Image Retrieval - Detection (RCNN) - Segmentation (FCN) - Depth Estimation - … Carl Doersch, Unsupervised Visual Representation Learning by Context Prediction, ICCV, 2015

  15. Review

    In this story, Unsupervised Visual Representation Learning by Context Prediction, (ContextPrediction), by Carnegie Mellon University, and University of California, is reviewed.

  16. Unsupervised Visual Representation Learning by Context Prediction

    We explore spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Our work demonstrates that unsupe...

  17. Unsupervised Visual Representation Learning by Context Prediction

    (DOI: 10.1109/ICCV.2015.167) This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing ...

  18. PDF Unsupervised Visual Representation Learning by Graph-Based ...

    available in images and videos for unsupervised visual representation learning. Examples include ego-motion [9,10], context prediction [11], and tracking [12]. However, ego-motion information does not correlate with semantic information well. Spatial context prediction [11] and tracking [12] consider only instance-level

  19. PDF Unsupervised Visual Representation Learning by Graph-based Consistent

    information well. Spatial context prediction [11] and tracking [12] consider only instance-level data as the training samples are taken within the same image and video. In this paper, we propose a new way to generate category-level training sam-ples for unsupervised visual representation learning. The general idea is that we

  20. PDF Page Redirection

    Page Redirection

  21. Unsupervised Representation Learning by Predicting Image Rotations

    Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision , pp. 1422-1430, 2015. Donahue et al. (2016) Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell.

  22. Unsupervised Visual Representation Learning by Context Prediction

    This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the ...

  23. Supervised vs. unsupervised learning: What's the difference?

    Unsupervised learning models, in contrast, work on their own to discover the inherent structure of unlabeled data. Note that they still require some human intervention for validating output variables. For example, an unsupervised learning model can identify that online shoppers often purchase groups of products at the same time.

  24. [1603.09246] Unsupervised Learning of Visual Representations by Solving

    Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. Mehdi Noroozi, Paolo Favaro. In this paper we study the problem of image representation learning without human annotation. By following the principles of self-supervision, we build a convolutional neural network (CNN) that can be trained to solve Jigsaw puzzles as a ...

  25. Multi-view Self-supervised Learning and Multi-scale Feature ...

    To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the ...