[논문리뷰] MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning

Paper Overview

CVPR'22

https://openaccess.thecvf.com/content/CVPR2022/html/Chen_MSDN_Mutually_Semantic_Distillation_Network_for_Zero-Shot_Learning_CVPR_2022_paper.html

CVPR 2022 Open Access Repository

MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning Shiming Chen, Ziming Hong, Guo-Sen Xie, Wenhan Yang, Qinmu Peng, Kai Wang, Jian Zhao, Xinge You; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2

openaccess.thecvf.com

Abstract

기존 연구는 간단하게 image의 global feature와 semantic vector를 align하거나

단방향 attention을 이용하여 limited latent semantic representation을 학습한다.

이 방버은 visual, attribute feature간의 고유한 semantic knowledge를 발견하는데 효과적이지 않다.

저자들은 Mutually Semantic Distillation Network (MSDN)을 제안하여

점진적으로 visual, attribute feature간의 고유한 semantic representatoin을 distill한다.

특히, MSDN은 attribute $\rightarrow$ visual attention sub-net과

visual $ \rightarrow$ attributeattention sub-net을 포함한다.

semantic distillation loss를 도입함으로써,

two mutual attention sub-nets은 함께 학습이 가능하고 학습과정동안 서로를 teach한다.

Keywords

Zero-Shot Learning, Knowledge Distillation

Mutually Semantic Distillation Network

Motivation

한 unseen sample은 seen sample과 공간적 정보를 공유할 수 있다.

그리고 이 공간적 정보는 semantic attribute의 풍부한 knowledge로 나타난다.

따라서 저자들은 MSDN을 제안한다.

Overview

저자들의 구조는 다음과 같다.

Notation

1. Attribute $\rightarrow$ Visual Attention Sub-net

저자들은 먼저 attribute $\rightarrow$ visual attention sub-net을 제안하여

attribute에 가장 연관된 image region을 찾아내고

주어진 image에서 attribute-based visual feature를 추출한다.

이것은 2개의 inputs이 있다.

1. image의 visual feature set $V = \left\{ v_{1}, ..., v_{R} \right\}$

2. semantic attribute vector set $A = \left\{ a_{1}, ..., a_{K} \right\}$

저자들은 각 attribute와 관련된 image region에 집중하고

각attribute를 대응되는 집중된 visual region feture와 비교하여

attributre의 importance를 결정한다.

저자들은 먼저 한 이미지의 $r$번째 region의 attention weight를 정의한다.

$W_{1}$는 learnable matrix로 각 region의 visual feature를 계산하여

각 sematnic attribute vector와 similarity를 비교한다.

따라서 attention weights $\left\{ \beta _{k}^{r} \right\}_{r=1}^{R}$를 얻는다.

그다음 각 attention weight에 기반한 attribute에 대한 attribute-based visual feature를 추출한다.

직관적으로 $F_k$는 image에 있는 semantic attribute에 대응되는 영역의 visual evidence를 capture한다.

만약 image가 명백한 attribute $a_k$를 가진다면, 모델은 $k$번째 attribute에

hight postive score를 할당할 것이다.

반대의 경우는 negative score를 할당할 것이다.

따라서 저자들은 attribute-based visual feature set을 얻는다.

그다음 저자들은 mapping function $M_1$를 도입하여

attribute-based visual feature를 semantic embedding space에 매핑한다.

매핑 정확도를 높이기 위해, attribute vectors $A = \left\{ a_{1}, ..., a_{K} \right\}$를 support로 사용한다.

$W_2$는 embedding matrix다.

$\psi_k$는 attribute score다.

따라서 MSDN은 mapped seamtnic embeding을 얻는다.

2. Visual $\rightarrow$ Attribute Attention Sub-net

저자들은 visual $\rightarrow$ attribute attention sub-net을 설계한다.

이것은 attribute-based visual feature와 상호보봔적이다.

저자들은 다시 image region과 관련된 semantic attribute를 먼저 찾는다.

$W_{1}$는 learnable matrix로

각 sematnic attribute vector와 similarity를 측정한다.

따라서 attention weights $\left\{ \tau _{r}^{k} \right\}_{k=1}^{K}$를 얻는다.

그다음, visual-based attribute feature를 얻는다.

$S_r$는 visual semantic representation이다.

비슷하게 또 매핑한다.

$W_4$는 embedidng matrix다.

Semantic embedding $\hat{\psi}$는 $R$차원이고 $K$차원 class semantic vector와 매칭하기 위해

이것을 semantic attribute space에 매핑한다.

$\Psi (x_{i}) = \hat{\Psi} (x_{i}) \times Att = \hat{\Psi} (x_{i}) \times (V^{T}W_{att}A)$

$W_{att}$는 learnable matrix다.

3. Model Optimization

Attribute-Based Cross-Entropy Loss

연관된 image, attribute embedding은 가까운 class sematnic vector $z^{c}$에 project되기 때문에,

저자들은 attribute-based cross-entropy loss with self-claibration을 적용한다.

이것은 image가 연관된 sematnic vector에 높은 score를 가질 수 있게 한다.

$f(x_{i})$는 attribute $\rightarrow$ visual 일때는 $\psi(x_{i})$이고

visual $\rightarrow$ attribute 일때는 $\Psi(x_{i})$다.

$\mathbb{I}_{\left[c \in C^{u}\right]}$는 indicator 함수로 조건이 맞으면 1 아니면 -1이다.

$\lambda_{cal}$는 self-calibration term weight다.

학습동안 unseen class에 0이 아닌 값을 할당함으로써 unseen class가 입력될 경우 큰 확률 값을 출력한다.

Semantic Distillation Loss

two mutual attention sub-net이 함께 학습하고

서로를 teach하기 위해 저자들은 semantic distillation loss를 도입한다.

이것은 Jensen-Shannon Divergence(JSD)와 L2로 구성된다.

Overall Loss

4. Zero-Shot Prediction

Experiments

Dataset

Evaluation Protocols

Implementation Details

1. Comparision with State-of-the-Arts

2. Abaltion Studies

3. Qualitative Results

Visualization of Attention Maps

t-SNE Visualizations

4. Hyperparameter Analysis

Effects of Combination Coefficients

Effects of Loss Weights

Conclusion and Discussion

In this paper, we propose a novel mutually semantic distillation network (MSDN) for ZSL. MSDN consists of two mutual attention sub-nets, i.e., attribute→visual and visual→semantic attention sub-nets, which learns attributebased visual features and visual-based attribute features for semantic embedding representations, respectively. To encourage mutual learning between the two attention sub-nets, we introduce a semantic distillation loss that aligns each other’s class posterior probabilities. Thus, MSDN distills the intrinsic semantic representations between visual and attribute features for effective knowledge transfer of ZSL. Extensive experiments on three popular benchmarks show the superiority of MSDN. we believe that our work will also facilitate the development of other visual-and-language learning systems, e.g., visual question answering

'Zero-Shot Learning > Classification' 카테고리의 다른 글

[논문리뷰] FREE: Feature Refinement for Generalized Zero-Shot Learning (0)	2024.04.11
[논문리뷰] HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning (2)	2024.02.27

KHS Computer Vision

[논문리뷰] MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning

Paper Overview

Mutually Semantic Distillation Network

Experiments

Conclusion and Discussion

'Zero-Shot Learning > Classification' 카테고리의 다른 글

티스토리툴바

[논문리뷰] MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning

Paper Overview

Mutually Semantic Distillation Network

Experiments

Conclusion and Discussion

'Zero-Shot Learning > Classification' 카테고리의 다른 글

관련글

티스토리툴바