Untitled

Problem Definition

Untitled

CLIP은 이미지와 텍스트의 global embedding 간의 cross-modal alignment를 강화하기 위해 학습
- anomaly segmentation은 pixel-level classification을 추구
- language와 align된 dense visual feature를 추출하는 것은 간단하지 않음
→ Window-base CLIP(WinCLIP)을 통해 multi-scale feature를 aggregate한 후 추출
기존 CLIP의 방식대로 zero-shot classification을 수행할 경우 naive한 prompt는 효과적이지 않음
- normal과 anomalous를 class로 정의한 text prompt를 활용한 zero-shot classification
→ naive baseline을 state-level word와 조합해 normal과 anomalous 상태를 잘 설명

Method

이미지 $x \in X$가 주어졌을 때, Anomaly Classification(AC)과 Anomaly Segmentation(AS) 모두 $x$ 내 abnormality를 예측하는 것이 목적
- AC는 binary classification $X \rightarrow \left\{-, + \right\}$로 간주
  - $+$는 image-level에서 anomaly가 존재함을 의미
- AS는 pixel-level로 확장해 $(h \times w)$ 크기의 이미지 $X \rightarrow \left\{-, + \right\}^{h \times w}$로 anomaly의 위치 표시
실제로는 anomaly score를 예측하는 문제가 종종 발생
- AC의 경우 일반적으로 $ascore : X \rightarrow [0,1]$을 매핑하도록 모델링
  - 이를 임계값으로 사용해 binary classification 수행

Contrastive Language Image Pre-training(CLIP)은 joint vision-language representation을 제공하는 대규모 사전 학습 방법론
수백만 개의 이미지-텍스트 쌍 $\left\{(x_t, s_t) \right\}^T_{t=1}$이 웹을 통해 주어졌을 때, CLIP은 image encoder $f$와 text encoder $g$를 contrastive learning을 통해 학습
- cosine similarity $<f(x), g(s)>$에 해당하는 $f(x_t)$와 $g(s_t)$ 간의 correlation을 최대화
입력 이미지 $x$와 제한된 free-form texts $S=\left\{s_1, … , s_k\right\}$가 주어졌을 때 CLIP은 $k$-way categorical distribution을 통해 zero-shot classification 수행 가능
- $\tau$ : temperature hyperparameter

Untitled

class words $C=\left\{c_1, … , c_k \right\}$에 대해 prompt template과 각 label을 결합할 경우 정확도 향상
- “a photo of a [c]” 등의 prompt template
여러 template을 aggregate한 prompt embedding이 성능 개선에 더 좋다고 알려짐
- “a cropped photo of a [c]” 등의 prompt template을 결합