[논문리뷰] See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

이 논문은 zero-shot을 수행하되,

2D image, point clouds를 같이 입력으로 사용하는

multi-modal 논문이다.

transductive GZSL을 수행한다.

ICCV'23

See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

Zero-shot point cloud segmentation aims to make deep models capable of recognizing novel objects in point cloud that are unseen in the training phase. Recent trends favor the pipeline which transfers knowledge from seen classes with labels to unseen classe

arxiv.org

Abstract

전형적으로 zero-shot은 word embedding으로부터

semantic feature얻어 이것으로 visual feature를 align한다.

그러나 point clouds는 제한된 정보를 가지고 있다.

따라서 저자들은

multi-modal zero-shot learning method를 제안한다.

point clouds와 image의 상호 보완적인 정보를 이용하여

더 정확한 visual-semantic alignment를 진행한다.

3. Methods

3.1. Problem Formulation

저자들은 우선 모든 class르 seen unseen으로 나눈다.

저자들은 generalized transductive zero-shot point clouds segmentation을 수행한다.

$P \in \mathbb{R}^{T \times 3}$는 point clouds 한 frame을 나타낸다.

$T$는 point 수고 (x,y,z) 좌표를 담고 있다.

$X \in \mathbb{R}^{3 \times H \times W}$는 대응되는 image를 나타낸다.

seen, unseen class는 $C^{s} = \left\{ c_{i}^{s} \right\}_{i=1}^{N^{s}}$, $C^{u} = \left\{ c_{i}^{u} \right\}_{i=1}^{N^{u}}$다.

seen, unseen class는 겹치지 않는다.

$W^{s} = \left\{ w_{i}^{s} \right\}_{i=1}^{N^{s}}$, $W^{u} = \left\{ w_{i}^{u} \right\}_{i=1}^{N^{u}}$는

seen, unseen에 대한 word embedding을 나타낸다.

저자들은 transductive zero-shot learning이기 때문에

$D_{train}= \left\{ (P_{i}^{s}, X_{i}^{s}, W_{i}^{s}, Y_{i})_{i=1}^{N^{s}}, (P_{j}^{u}, X_{j}^{u}, W_{j}^{u})_{j=1}^{N^{u}}\right\}$다.

3.2. Overview

저자들의 모듈은 4가지로 구성되어 있다.

Feature Extraction, Semantic-Visual Feature Enhancement(SVFE),

Semantic-Guided Visual Feature Fusion (SGVF), Semantic-Visual Alignment

3.3. Sementic Guided Visual Feature Fusion

Point clouds는 정확한 location과 geometry 정보를 포함하고 있고

image는 풍부한 texture와 color 정보를 포함하고 있다.

따라서 저자들은 semantic-visual alignment를 우해

multi-modal visual data를 이용하는 것을 제안한다.

저자들은 semantic feature에 대해

adaptive selection mechanism을 설계하여

network가 semantic guidance 아래에서

자동적으로 two visual modality로부터

다양한 정보를 학습하고

이얷들을 recher visual feature로 합칠 수 있도록 한다.

저자들은 2D, 3D 각각에 대해

weight matrix $w$를 구한다.

이것은 multi-head attention을 이용한다.

이 weight matrix와 visual feature간의

element-wise multiplication을 적용함으로써

fused visual feature를 얻는다.

3.4. Semantic-Visual Feature Enhancement

SGVF를 진행하는 동안

visual, semantic feature간의 huge domain gap은

효과적으로 visual feature를 융합하는 것을 막는다.

따라서 저자들은 이 gap을 줄이는 방안을 고려한다.

저자들은 cross-attention mechanism으로 knowledge interaction을 수행한다.

Semantic Feature Enhancement

semantic feature $F_{s}$를 향상시키기 위해

저자들은 $F_{s}$를 query $q$로 visual feature를 key $k$, value $v$로 사용하여

Transformer Decoder에 입력한다.

저자들은 enhance feature $F_{s}$를 point feature $F_{l}$으로 수행하고

그 다음 image feature $F_{i}$로 수행한다.

Visual Feature Enhancement

저자들은 visual feature도 semantic feature를 이용하여

향상시킨다.

저자들은 이렇게 visual feature와 semantic feature간의 거리를 줄였다.

3.5. Semantic-Visual Alignment

Loss function

TGP를 따라 저자들도 cross entropy loss와

unknown-aware InfoNCE loss를 사용한다.

$f_{i}^{t}$는 $i$번째 샘플 안에 있는 $t$번째 point의 visual feature다.

$e_{y_{I^{t}}}$는 대응되는 ground truth semantic representation이다.

$\tau$는 inversed temperature term이다.

$D(\cdot )$은 smiliarity function으로 dot product다.

저자들은 seen calss에 대해 평향되는 것을 막기 위해

seen과 unceen을 밀어내는 loss를 구성한다.

overall loss는 다음과 같다.

Inference

4. Experiments

데이터 셋은 SemanticKITTI와 nuScenes를 사용한다.

4.4. Comparison Results

Comparision with 3D methods

Comparision with extensions of 2D methods

Comparision with popular multi-modal fusion methods

4.5. Ablation Studies

'Zero-Shot Learning > 3D Segmentation' 카테고리의 다른 글

[논문리뷰]Bridging Language and Geometric Primitives for Zero-shot Point Cloud Segme (0)	2025.03.20
[논문리뷰] Zero-Shot Point Cloud Segmentation by Semantic-Visual Aware Synthesis (0)	2024.01.23
[논문리뷰] Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Clouds (2)	2024.01.04

KHS Computer Vision

[논문리뷰] See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

'Zero-Shot Learning > 3D Segmentation' 카테고리의 다른 글

티스토리툴바

[논문리뷰] See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data

'Zero-Shot Learning > 3D Segmentation' 카테고리의 다른 글

관련글

티스토리툴바