ZERO-SHOT PSEUDO LABELS GENERATION USING SAM AND CLIP FOR SEMI-SUPERVISED SEMANTIC SEGMENTATION

Abstract

Semantic segmentation is a fundamental task in medical image analysis and autonomous driving and has a problem with the high cost of annotating the labels required in training. To address this problem, semantic segmentation methods based on semi-supervised learning with a small number of labeled data have been proposed. For example, one approach is to train a semantic segmentation model using images with annotated labels and pseudo labels. In this approach, the accuracy of the semantic segmentation model depends on the quality of the pseudo labels, and the quality of the pseudo labels depends on the performance of the model to be trained and the amount of data with annotated labels. In this paper, we generate pseudo labels using zero-shot annotation with the Segment Anything Model (SAM) and Contrastive Language-Image Pretraining (CLIP), improve the accuracy of the pseudo labels using the Unified Dual-Stream Perturbations Approach (UniMatch), and use them as enhanced labels to train a semantic segmentation model. The effectiveness of the proposed method is demonstrated through the experiments using the public datasets: PASCAL and MS COCO.

Zero-shot annotation

Our method extracts segment-wise embeddings by applying SAM-based pooling to the feature map generated by the image encoder of CLIP. Specifically, given a feature map and the corresponding binary masks produced by SAM, we compute the embedding of each segment by averaging the features within its masked region. Class labels, such as object names, are embedded using CLIP’s text encoder. We then calculate the cosine similarity between each segment embedding and the class label embeddings. The segment is assigned the class label with the highest similarity score. If no class exceeds a predefined similarity threshold, the segment is left unassigned (e.g., background regions).

Training flow

We refine pseudo labels into enhanced labels for training. The model predicts the per-pixel class probability distribution p from the input image. For unlabeled data, we apply one weak perturbation, a feature-level perturbation, and two strong perturbations to generate predictions p_w, p_fp, p_s1, and p_s2, respectively. A smooth loss L_smooth is computed between the model’s predictions and the enhanced pseudo labels for the unlabeled images, while a supervised loss L_s is computed between predictions p_l and ground-truth labels for the labeled images. The model is trained by minimizing the sum of these losses over a mini-batch that includes both labeled and unlabeled samples. Label smoothing is applied to L_smooth to prevent overfitting to enhanced labels.

Quantitative Evaluation

The quantitative evaluation results of the proposed method on PASCAL and COCO are shown below. For PASCAL, the proposed method achieves higher mIoU than UniMatch and LogicDiag when the number of labeled images is greater than 183. It also achieves comparable or even better accuracy than AllSpark and BeyondPixels. Notably, while AllSpark is trained with 513 × 513 pixel images, the proposed method uses smaller 321 × 321 pixel images, demonstrating that high-performance semi-supervised learning can be achieved even with lower-resolution inputs. For COCO, the proposed method consistently outperforms UniMatch, LogicDiag, and AllSpark across all labeled image settings. In particular, it shows strong performance in low-label regimes. While AllSpark and UniMatch are trained with 513 × 513 pixel images, the proposed method uses 400 × 400 pixel inputs. Similar to the results on PASCAL, this confirms that the proposed method enables effective semi-supervised learning even with smaller image sizes.

Comparison with existing methods for PASCAL.
Bold type indicates the best results for each splitting of the labeled images.
Method	mIoU [%] ↑
Method	1/16 (92)	1/8 (183)	1/4 (366)	1/2 (732)	Full (1,464)
UniMatch [Yang, CVPR 2023]	75.2	77.19	78.8	79.9	81.2
LogicDiag [Liang, ICCV 2023]	73.3	76.7	77.9	79.4	—
AllSpark [Wang, CVPR 2024]	76.07	78.41	79.77	80.75	82.12
BeyondPixels [Howlader, ECCV 2024]	77.3	78.6	79.8	80.8	81.7
Ours	65.30	78.69	79.8	80.56	82.15

Comparison with existing methods for COCO.
Bold type indicates the best results for each splitting of the labeled images.
Method	mIoU [%] ↑
Method	1/512 (232)	1/256 (463)	1/128 (925)	1/64 (1,849)
UniMatch [Yang, CVPR 2023]	31.86	38.88	44.35	48.17
LogicDiag [Liang, ICCV 2023]	33.1	40.3	45.4	48.8
AllSpark [Wang, CVPR 2024]	34.10	41.65	45.48	49.56
Ours	46.06	48.20	48.98	51.20

Qualitative Evaluation

The following figure shows the segmentation results of each method on COCO with the 1/512 (232) split. Since COCO contains 80 labeled classes, UniMatch and AllSpark, which generate pseudo labels based on the predictions of the model being trained, fail to detect some classes. In contrast, the proposed method, which generates pseudo labels via zero-shot annotation using SAM and CLIP, can detect most of the classes. Although UniMatch and AllSpark sometimes assign incorrect labels to detected segments, the proposed method assigns the correct labels.

BibTeX

@article{Saito-ICIP-2025,
         author  = "Saito, N. and Ito, S. and Ito, K. and Aoki, T.",
         title   = "Zero-Shot Pseudo Labels Generation Using SAM and CLIP for Semi-Supervised Semantic Segmentation"
         journal = "ICIP",
         year    = "2025",
         month   =  sep
         }

Copyright

© 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.