[논문 리뷰] OpenScene: 3D Scene Understanding with Open Vocabularies

3D Open-vocabulary 논문들이

CVPR 2023에 쏟아져 나온다.

재밌는게 초창기 논문도 CVPR'23이고

후속논문도 CVPR'23이다.

AI 분야 다른 분야에 비해 발전 속도가 정말 빠른데

이제 학회마저 발전 속도를 못따라가는듯 하다.

CVPR'23

https://arxiv.org/abs/2211.15654

OpenScene: 3D Scene Understanding with Open Vocabularies

Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with

arxiv.org

Abstract

저자들은 OpenScene이라는

3D scene point에 대한 CLIP space의

text, image pixel feature를 예측하는 접근법을 제안한다.

이로 인해 사용자가 임의의 text 쿼리를 입력하면

대응되는 heat map을 알 수 있다.

Open Vocabulary scene understaning 초창기 논문이다.

3. Method

3.1. Image Feature Extraction

Image Feature Extraction

$H \times W$ RGB 이미지가 주어진다면

저자들은 간단하게 segmentation model ${\large \varepsilon}^{2D}$을

통해 pixel별 embedding $I_{i} \in \mathbb{R}^{H \times W \times C}$을 구한다.

이때 ${\large \varepsilon}^{2D}$는 OpenSeg나 LSeg같은

text embedding을 추출하는 모델을 사용한다.

2D-3D Pairing

point clouds $P \in \mathbb{R}^{M \times 3}$의

point cloud $p \in \mathbb{R}^{3}$가 주어지면

저자들은 대응되는 image pixel $u = (u,v)$를 구한다.

이렇게 image-point가 매칭되도록 한다.

Fusing Per-Pixel Features

point $p$에 대한 frame $i$의 2D feature를

$f_{i} = I_{i}(u) \in \mathbb{R}^{c}$라고 한다.

이제 point $p$에 연관된 $K$ views가 있을 때

$f^{2D} = \phi(f_{1}, ... , f_{K})$를

$\phi : \mathbb{R}^{K \times C}\mapsto \mathbb{R}^{C}$를 통해

하나의 vector로 만든다.

이것은 average pooling 연산이다.

이렇게 $M$개의 point clouds에 feature를 생성한다.

$F^{2D} = \left\{f_{1}^{2D}, ... , f_{M}^{2D}\right\} \in \mathbb{R}^{M \times C}$

3.2. 3D DIstillation

point clouds에 대한 feature를 얻었기 때문에

3D model에 이를 distillation하면 된다.

point clouds $P$가 주어지면 다음과 같이 3D featuer를 얻는다.

$F^{3D} = \left\{f_{1}^{3D}, ... , f_{M}^{3D}\right\}$

이를 이제 cosine similarity loss를 통해 학습한다.

저자들은 3D backbone으로 MinkowskiNet18A를 사용한다.

3.3. 2D-3D Feature Ensemble

저자들은 더 나은 성능을 산출하기 위해

2D-3D ensemble method를 사용한다.

저자들이 2D 모델은 작은 물체 또는 벽에 그림이 있는 경우를 잘 찾고

3D 모델은 벽과 바닥같은 구별되는 형태들을 잘 찾는다고 한다.

${\large \varepsilon}^{text}$ CLIP을 통해서 $N$개의 embedding을 구하고

$T = \left\{ t_{1}, ... , t_{N} \right\} \in \mathbb{R}^{N \times C}$

2D 3D feature에 대해 각각 유사도를 구한다.

이후 이를 사용하여 ensemble score를 구한다.

3.4. Inference

위와 같은 Ensemble과정 때문에

사전에 point clouds 2D feature를 구성해야 한다.

따라서 해당 방법은 오프라인(실시간x)으로 inference한다.

4. Experiments

저자들은 ScanNet, Matterport3D, nuScenes를 사용한다.

4.1. Comparisions

Comparison on zero-shot 3D semantic segmentation

Comparision on 3D semantic segmenation benchmarks.

Impact of increasing the number of object classes.

4.2. Ablation Studies & Analysis

Is our 2D-3D ensemble method effective?

What featuers does our 2D-3D ensemble method use most?

5. Applications

Open-vocabulary 3D object search

Image-based 3D object detection

Open-vocabulary 3D scene understanding and exploration

'Open Vocabulary > 3D Segmentation' 카테고리의 다른 글

[논문리뷰] Lowis3D: Language-Driven Open-WorldInstance-Level 3D Scene Understanding (0)	2023.12.19
[논문 리뷰]Open-Vocabulary Affordance Detection in 3D Point Clouds (1)	2023.11.23
[논문리뷰]PLA: Language-Driven Open-Vocabulary 3D Scene Understanding (1)	2023.11.07
[논문 리뷰] CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D Dense CLIP (0)	2023.08.29
[논문리뷰] Language-Grounded Indoor 3D Semantic Segmentation in the Wild (0)	2023.08.22

KHS Computer Vision

[논문 리뷰] OpenScene: 3D Scene Understanding with Open Vocabularies

'Open Vocabulary > 3D Segmentation' 카테고리의 다른 글

티스토리툴바

[논문 리뷰] OpenScene: 3D Scene Understanding with Open Vocabularies

'Open Vocabulary > 3D Segmentation' 카테고리의 다른 글

관련글

티스토리툴바