A Self-Training Framework Based on Multi-Scale Attention Fusion for Weakly Supervised Semantic Segmentation
Guoqing Yang
Chuang Zhu*
Yu Zhang
[Paper]
[GitHub]


Abstract

Weakly supervised semantic segmentation (WSSS) based on image-level labels is challenging since it is hard to obtain complete semantic regions. To address this issue, we propose a self-training method utilizing fused multi-scale class- aware attention maps. We observe that attention maps of differ- ent scales contain rich complementary information, especially for some large and small objects. Therefore, we first collect information from attention maps of different scales of images and obtain multi-scale attention maps. Then, we utilize denoising and reactivation strategies to enhance those potential regions and reduce noisy areas. After that, the refined attention maps are used to retrain the network. Experiments show that our method allows the model to see rich semantic information of multi-scale images on single-scale images and attains 72.4% mIou scores on both the PASCAL VOC 2012 validation and test sets. The code is available at https://bupt-ai-cz.github.io/SMAF.


Overall Framework

Overview of our proposed method. We first pre-train the student branch using an existing WSSS method and initialize the teacher branch. The teacher branch is responsible for generating fused multi-scale attention maps, which are then refined by denoising and reactivation strategies. Finally, the refined multi-scale attention maps are used to train the student branch.



Citation

@article{yang2023self,
title={A Self-Training Framework Based on Multi-Scale Attention Fusion for Weakly Supervised Semantic Segmentation},
author={Yang, Guoqing and Zhu, Chuang and Zhang, Yu},
journal={arXiv preprint arXiv:2305.05841},
year={2023}
}


Visualization Results

Visual comparison of attention maps quality. From top to bottom: original image, ground truth, attention maps generated by EPS [3], and attention maps generated by our method.

Qualitative segmentation results on PASCAL VOC 2012 val set. From top to bottom: input images, ground truth, segmentation results of our method.



Results