0. Abstract

최근 CLIP과 같은 거대 large-scale VLMs이 Zero-Shot Anomaly Segmentation(ZSAS) task에서 가능성을 보임
- 통합 모델을 활용하여 manually defined text prompt를 통해 보지 못한 물체에 대한 이상 탐지가 가능
- 그러나 기존 방법론은 탐지할 물체의 카테고리를 알고 있다고 가정하여 텍스트 프롬프트를 설정
  - data privacy scenario에서는 적용하기 어려움
- 같은 카테고리라도 특정 구성 요소와 생산 공정의 변화로 상당한 차이가 발생하여 텍스트 프롬프트 설계가 매우 어려움
본 논문에선 CLIP에 기반하여 ZSAS task를 수행하는 Visual Context Prompting Model(VCP-CLIP)을 제안
- visual context prompting을 사용해 CLIP의 anomalous semantic에 대한 인식 능력을 활성화
- 텍스트 프롬프트에 global visual information을 임베딩하는 Pre-VCP 모듈 설계
  - product-specific prompt를 설계할 필요가 없음
- 이미지의 fine-grained feature를 사용하여 text embedding을 조정하는 Post-VCP 모듈 제안
10개의 real-world industrial anomaly segmentation dataset에 대한 ZSAS에서 SOTA 성능 달성

1. Introduction

기존 CLIP 기반 ZSAS 방법론들은 이미지와 대응하는 two-class text를 joint space로 매핑해 cosine similarity를 계산
- 결함과 관련된 text와 높은 유사도를 가지는 이미지 영역을 anomaly로 간주
기존 방법론들은 탐지할 이미지의 product category를 안다고 가정하여 product-specific textual prompt 설계

“a photo of a normal wood”
- product category는 data privacy 상황에서 얻을 수 없거나, 예측할 수 없기 때문에 이런 방법론들을 사용할 수 없음
텍스트 프롬프트의 product category를 WinCLIP에서 의미적으로 유사한 용어로 대체하는 실험을 진행
- ex) bottle → container, vessel 등
- segmentation 성능이 최대 $\pm$8%의 Average Precision(AP) metric이 변하는 것을 확인
- 일부 product 이름이 모호하기 때문에 텍스트 프롬프트에서 product 이름의 중요성을 확인
  - VisA 데이터셋에서의 pcb1, pcb2, pcb3 등
AnomalyCLIP에선 모든 product name을 동일하게 “object”로 대체하여 object-agnostic text prompt를 설계
- 복잡한 industrial scenario에서 적용하기가 어려움
WinCLIP, April-GAN, AnomalyCLIP은 어떠한 interaction 없이 이미지와 텍스트를 개별적으로 joint space에 매핑
- modality 간 이해가 용이하지 않고, 특정 text prompt에 이미지가 overfitting되기 쉬움
- Fig.1(a)와 같이 이미지와 텍스트를 직접 align할 경우 다양한 modality를 제한적으로 파악

위의 문제들을 해결하기 위해 CLIP에 기반한 Visual Context Prompting(VCP) 모델을 통해 ZSAS task를 수행
- 보조 데이터셋의 제한된 seen product에 대한 학습 이후 unseen product에 대한 anomaly segmentation 수행
- 현존하는 방법론들은 manually defined text prompt에 의존(Fig.2(b))
- unified text prompt는 baseline으로 사용(Fig.2(c))
- 본 논문에선 product category를 연속적인 학습 가능한 토큰으로 설정(Fig.2(d))
global image feature에 대한 이해를 활용하기 위해 Deep Text Prompting(DTP)를 도입해 text space을 재정의
- baseline과 비교했을 때 Pre-VCP는 uniform prompt에서 image-specific prompt로 변환을 가능하게 함
  - prompt 설계의 비용을 확연하게 감소시킬 수 있음
- Post-VCP는 output text embedding을 fine-grained visual feature에 기반해 조정
  - 두 modality 간의 feature에 대한 이해를 강화
  - anomalous region을 정확하게 segment하는 CLIP의 능력을 강화
결론적으로 본 논문은 CLIP에 기반해 ZSAS task를 수행하는 VCP-CLIP을 제안
- image encoder로부터 global / dense image embedding을 추출
  - global feature : Pre-VCP 모듈을 통과한 후 입력 text prompt에 통합
  - dense feature : anomaly segmentation에서 fine-grained image feature를 위해 사용
- Post-VCP 모듈은 fine-grained image feature에 기반해 text embedding을 업데이트하기 위해 이후에 설계
  - 효과적으로 두 modality 간 이해를 효과적으로 사용하고, 새로운 product에 대한 모델의 일반화 능력을 강화
- 최종 anomaly map은 원본 text embedding과 dense image embedding로 align된 segmentation 결과 통합
  - segmentation 성능을 강화하는 데 도움을 줌

0. Abstract

1. Introduction

2. Related Work