LOREAL: Mitigating Low-Resolution Challenges in Vision-Language Models with Attribute-driven Prompt Self-Distillation

Xucong Wang1     Pengkun Wang1,2     Zhe Zhao1,3     Liheng Yu1     Rui Mao1     Yang Wang1,2
1University of Science and Technology of China (USTC)     2Suzhou Institute for Advanced Research, USTC     3City University of Hong Kong

CVPR 2026
First Image

Comparisons of Classic KD, PromptKD and our LOREAL. The upper/lower part is the training/inference stage. TE means the Text Encoder. LR means Low-Resolution. (a) Classic KD of VLMs, where students are fully-tuned. (b) PromptKD, which leverages prompts to learning from teachers. Both (a) and (b) are designed for non-LR inference. (c) The proposed LOREAL, a prompt self-distillation scheme to solve the LR challenges. Here, two students are the same models but fed with different inputs. LOREAL leverages fine-grained attribute guidance and simultaneously distills from two levels to boost the model's robustness to data resolutions.

Abstract

Prompt Learning (PL) has emerged as a parameter-efficient technique for adapting Vision-Language Models (VLMs) to downstream tasks. However, almost all existing PL methods are primarily designed and evaluated on well-curated datasets, overlooking a critical post-deployment phenomenon, i.e., the intrinsic connection between input resolution and storage-memory consumption. Specifically, to satisfy the stringent storage-memory constraints on edge devices, models are often limited to low-resolution inputs (e.g., less than 224x224 for CLIP-ViT/B-16) and generate fewer tokens (with the position embedding resized), which poses a unique challenge in performance robustness.

To tackle this issue, we propose LOREAL, an efficient prompt self-distillation framework that learns resolution-invariant representations by excavating attribute semantics. At the heart of LOREAL is a dual-student architecture, i.e., two student models fed with inputs at different resolutions synergistically learn from each other. Building upon this, we contextualize the students' prompt with resolution-invariant attributes queried from the LLM, then leverage cross-modality meta-nets to generate attribute semantics. These meta-nets are bridged between the different encoders of two students, wherein we introduce Low-Level Distillation (LLD) and High-Level Distillation (HLD) to facilitate the learning of more cross-resolution representations. Extensive experiments show that LOREAL significantly improves VLMs' performance and robustness under varied resolution settings, underscoring significant practical utilities.

Overview

Overview Image

Overview of LOREAL

Our LOREAL framework. (a): We leverage the LLM to generate several resolution-invariant attributes. (b): Self-distillation framework. We utilize the visual embeddings to fill the prompt attributes via meta-nets, then leverage Low-Level Distillation (LLD) and High-Level Distillation (HLD) for self-distillation. Only the meta-nets are learnable, and the parameters of two illustrated meta-nets are shared. LR represents Low-Resolution. (c): Inference stage. The model takes LR images and contextualizes prompts with the meta-nets.


LOREAL Efficiency and Adaptability

Following the paper, we report an efficiency study (Table 4) on CLIP-ViT-B/16 with MaPLe and MMRL as backbones: LOREAL only adds lightweight meta-nets (+104.5K tunable parameters) while incurring minimal extra training and inference time and modest memory overhead, yet yields large gains in harmonic mean (HM) accuracy on the LR Base-to-New benchmark at low input resolutions φ ∈ {962, 1442}. Training and inference times are measured per sample as in the paper.

Efficiency study. Tra./Infer. time per sample; HM on LR-B2N (16-shot, averaged over 11 datasets, 3 runs).
φ Method Tunable params Tra. time Infer. time Infer. mem. HM (↑)
96×96 MaPLe 3555K 107 ms 32 ms 612 MB 34.85
MaPLe + LOREAL +104.5K +4 ms +1 ms +33 MB 57.25
MMRL 4992K 113 ms 29 ms 1236 MB 38.90
MMRL + LOREAL +104.5K +5 ms +1 ms +33 MB 61.71
144×144 MaPLe 3555K 126 ms 38 ms 844 MB 68.80
MaPLe + LOREAL +104.5K +4 ms +2 ms +33 MB 73.72
MMRL 4992K 145 ms 37 ms 1440 MB 72.43
MMRL + LOREAL +104.5K +5 ms +1 ms +33 MB 76.56

For adaptability, we summarize the Low-Resolution Cross-dataset Evaluation (LR-CE) benchmark (Table 2): models are trained on ImageNet (16-shot) and evaluated on LR test sets of ImageNet plus ten target datasets at φ ∈ {962, 1442, 1922}. Numbers below are the target average (mean accuracy over the ten target datasets). We also report the Low-Resolution Domain Generalization (LR-DG) benchmark (Table 3): training on ImageNet, testing on LR variants of ImageNet-V2, ImageNet-S, ImageNet-A, and ImageNet-R; the last columns are the target average over those four sets. At φ = 962, LOREAL yields about +12.71% average improvement on LR-DG across backbones (paper).

LR-CE benchmark — target average accuracy (%) over ten datasets (excluding ImageNet source).
Method φ = 962 φ = 1442 φ = 1922
CoOp 31.72 63.10 69.20
CoOp + LOREAL 45.77 69.78 74.67
MaPLe 36.07 64.02 69.61
MaPLe + LOREAL 52.77 72.61 77.19
MMA 37.27 65.32 72.28
MMA + LOREAL 54.66 72.99 76.81
MMRL 40.77 69.71 75.30
MMRL + LOREAL 55.08 74.12 79.14

LR-DG benchmark — target average accuracy (%) over ImageNet-V2, ImageNet-S, ImageNet-A, and ImageNet-R.
Method φ = 962 φ = 1442 φ = 1922
CoOp 15.02 43.87 53.16
CoOp + LOREAL 23.27 48.32 55.93
MaPLe 10.56 41.56 52.59
MaPLe + LOREAL 26.86 48.42 54.78
MMA 14.47 43.69 53.18
MMA + LOREAL 25.52 47.35 54.16
MMRL 15.42 47.34 57.37
MMRL + LOREAL 30.64 50.53 58.47

BibTeX

coming soon :)