LOREAL | LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation

LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation

Xucong Wang¹ Pengkun Wang^1,² Zhe Zhao^1,3 Liheng Yu¹ Rui Mao¹ Yang Wang^1,2

¹University of Science and Technology of China (USTC) ²Suzhou Institute for Advanced Research, USTC ³City University of Hong Kong

CVPR 2026 Highlight

Abstract

Prompt Learning (PL) has emerged as a parameter-efficient technique for adapting Vision-Language Models (VLMs) to downstream tasks. However, almost all existing PL methods are primarily designed and evaluated on well-curated datasets, overlooking a critical post-deployment phenomenon, i.e., the intrinsic connection between input resolution and storage-memory consumption. Specifically, to satisfy the stringent storage-memory constraints on edge devices, models are often limited to low-resolution inputs (e.g., less than 224x224 for CLIP-ViT/B-16) and generate fewer tokens (with the position embedding resized), which poses a unique challenge in performance robustness.

To tackle this issue, we propose LOREAL, an efficient prompt self-distillation framework that learns resolution-invariant representations by excavating attribute semantics. At the heart of LOREAL is a dual-student architecture, i.e., two student models fed with inputs at different resolutions synergistically learn from each other. Building upon this, we contextualize the students' prompt with resolution-invariant attributes queried from the LLM, then leverage cross-modality meta-nets to generate attribute semantics. These meta-nets are bridged between the different encoders of two students, wherein we introduce Low-Level Distillation (LLD) and High-Level Distillation (HLD) to facilitate the learning of more cross-resolution representations. Extensive experiments show that LOREAL significantly improves VLMs' performance and robustness under varied resolution settings, underscoring significant practical utilities.

Overview

Overview of LOREAL

Our LOREAL framework. (a): We leverage the LLM to generate several resolution-invariant attributes. (b): Self-distillation framework. We utilize the visual embeddings to fill the prompt attributes via meta-nets, then leverage Low-Level Distillation (LLD) and High-Level Distillation (HLD) for self-distillation. Only the meta-nets are learnable, and the parameters of two illustrated meta-nets are shared. LR represents Low-Resolution. (c): Inference stage. The model takes LR images and contextualizes prompts with the meta-nets.

LOREAL Efficiency and Adaptability

Following the paper, we report an efficiency study (Table 4) on CLIP-ViT-B/16 with MaPLe and MMRL as backbones: LOREAL only adds lightweight meta-nets (+104.5K tunable parameters) while incurring minimal extra training and inference time and modest memory overhead, yet yields large gains in harmonic mean (HM) accuracy on the LR Base-to-New benchmark at low input resolutions φ ∈ {96², 144²}. Training and inference times are measured per sample as in the paper.

**Efficiency study.** Tra./Infer. time per sample; HM on LR-B2N (16-shot, averaged over 11 datasets, 3 runs).

φ	Method	Tunable params	Tra. time	Infer. time	Infer. mem.	HM (↑)

96×96	MaPLe	3555K	107 ms	32 ms	612 MB	34.85
MaPLe + LOREAL	+104.5K	+4 ms	+1 ms	+33 MB	57.25
MMRL	4992K	113 ms	29 ms	1236 MB	38.90
MMRL + LOREAL	+104.5K	+5 ms	+1 ms	+33 MB	61.71

144×144	MaPLe	3555K	126 ms	38 ms	844 MB	68.80
MaPLe + LOREAL	+104.5K	+4 ms	+2 ms	+33 MB	73.72
MMRL	4992K	145 ms	37 ms	1440 MB	72.43
MMRL + LOREAL	+104.5K	+5 ms	+1 ms	+33 MB	76.56

Efficiency study. Tra./Infer. time per sample; HM on LR-B2N (16-shot, averaged over 11 datasets, 3 runs).

Method

Tunable params

Tra. time

Infer. time

Infer. mem.

HM (↑)

96×96

MaPLe

3555K

107 ms

32 ms

612 MB

34.85

MaPLe + LOREAL

+104.5K

+4 ms

+1 ms

+33 MB

57.25

MMRL

4992K

113 ms

29 ms

1236 MB

38.90

MMRL + LOREAL

+104.5K

+5 ms

+1 ms

+33 MB

61.71

144×144

MaPLe

3555K

126 ms

38 ms

844 MB

68.80

MaPLe + LOREAL

+104.5K

+4 ms

+2 ms

+33 MB

73.72

MMRL

4992K

145 ms

37 ms

1440 MB

72.43

MMRL + LOREAL

+104.5K

+5 ms

+1 ms

+33 MB

76.56

**LR-CE benchmark — target average accuracy (%)** over ten datasets (excluding ImageNet source).

Method	φ = 96²	φ = 144²	φ = 192²

CoOp	31.72	63.10	69.20
CoOp + LOREAL	45.77	69.78	74.67
MaPLe	36.07	64.02	69.61
MaPLe + LOREAL	52.77	72.61	77.19
MMA	37.27	65.32	72.28
MMA + LOREAL	54.66	72.99	76.81
MMRL	40.77	69.71	75.30
MMRL + LOREAL	55.08	74.12	79.14

LR-CE benchmark — target average accuracy (%) over ten datasets (excluding ImageNet source).

Method

φ = 96²

φ = 144²

φ = 192²

CoOp

31.72

63.10

69.20

CoOp + LOREAL

45.77

69.78

74.67

MaPLe

36.07

64.02

69.61

MaPLe + LOREAL

52.77

72.61

77.19

MMA

37.27

65.32

72.28

MMA + LOREAL

54.66

72.99

76.81

MMRL

40.77

69.71

75.30

MMRL + LOREAL

55.08

74.12

79.14

**LR-DG benchmark — target average accuracy (%)** over ImageNet-V2, ImageNet-S, ImageNet-A, and ImageNet-R.

Method	φ = 96²	φ = 144²	φ = 192²

CoOp	15.02	43.87	53.16
CoOp + LOREAL	23.27	48.32	55.93
MaPLe	10.56	41.56	52.59
MaPLe + LOREAL	26.86	48.42	54.78
MMA	14.47	43.69	53.18
MMA + LOREAL	25.52	47.35	54.16
MMRL	15.42	47.34	57.37
MMRL + LOREAL	30.64	50.53	58.47

LR-DG benchmark — target average accuracy (%) over ImageNet-V2, ImageNet-S, ImageNet-A, and ImageNet-R.

Method

φ = 96²

φ = 144²

φ = 192²

CoOp

15.02

43.87

53.16

CoOp + LOREAL

23.27

48.32

55.93

MaPLe

10.56

41.56

52.59

MaPLe + LOREAL

26.86

48.42

54.78

MMA

14.47

43.69

53.18

MMA + LOREAL

25.52

47.35

54.16

MMRL

15.42

47.34

57.37

MMRL + LOREAL

30.64

50.53

58.47

BibTeX

@InProceedings{Wang_2026_CVPR, author = {Wang, Xucong and Wang, Pengkun and Zhao, Zhe and Yu, Liheng and Mao, Rui and Wang, Yang}, title = {LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {39152-39163} }

LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation

Abstract

Overview

Overview of LOREAL

LOREAL Efficiency and Adaptability

BibTeX