Prompt Learning (PL) has emerged as a parameter-efficient technique for adapting Vision-Language Models (VLMs) to downstream tasks. However, almost all existing PL methods are primarily designed and evaluated on well-curated datasets, overlooking a critical post-deployment phenomenon, i.e., the intrinsic connection between input resolution and storage-memory consumption. Specifically, to satisfy the stringent storage-memory constraints on edge devices, models are often limited to low-resolution inputs (e.g., less than 224x224 for CLIP-ViT/B-16) and generate fewer tokens (with the position embedding resized), which poses a unique challenge in performance robustness.
To tackle this issue, we propose LOREAL, an efficient prompt self-distillation framework that learns resolution-invariant representations by excavating attribute semantics. At the heart of LOREAL is a dual-student architecture, i.e., two student models fed with inputs at different resolutions synergistically learn from each other. Building upon this, we contextualize the students' prompt with resolution-invariant attributes queried from the LLM, then leverage cross-modality meta-nets to generate attribute semantics. These meta-nets are bridged between the different encoders of two students, wherein we introduce Low-Level Distillation (LLD) and High-Level Distillation (HLD) to facilitate the learning of more cross-resolution representations. Extensive experiments show that LOREAL significantly improves VLMs' performance and robustness under varied resolution settings, underscoring significant practical utilities.
Our LOREAL framework. (a): We leverage the LLM to generate several resolution-invariant attributes. (b): Self-distillation framework. We utilize the visual embeddings to fill the prompt attributes via meta-nets, then leverage Low-Level Distillation (LLD) and High-Level Distillation (HLD) for self-distillation. Only the meta-nets are learnable, and the parameters of two illustrated meta-nets are shared. LR represents Low-Resolution. (c): Inference stage. The model takes LR images and contextualizes prompts with the meta-nets.
Following the paper, we report an efficiency study (Table 4) on CLIP-ViT-B/16 with MaPLe and MMRL as backbones: LOREAL only adds lightweight meta-nets (+104.5K tunable parameters) while incurring minimal extra training and inference time and modest memory overhead, yet yields large gains in harmonic mean (HM) accuracy on the LR Base-to-New benchmark at low input resolutions φ ∈ {962, 1442}. Training and inference times are measured per sample as in the paper.
| φ | Method | Tunable params | Tra. time | Infer. time | Infer. mem. | HM (↑) |
|---|---|---|---|---|---|---|
| 96×96 | MaPLe | 3555K | 107 ms | 32 ms | 612 MB | 34.85 |
| MaPLe + LOREAL | +104.5K | +4 ms | +1 ms | +33 MB | 57.25 | |
| MMRL | 4992K | 113 ms | 29 ms | 1236 MB | 38.90 | |
| MMRL + LOREAL | +104.5K | +5 ms | +1 ms | +33 MB | 61.71 | |
| 144×144 | MaPLe | 3555K | 126 ms | 38 ms | 844 MB | 68.80 |
| MaPLe + LOREAL | +104.5K | +4 ms | +2 ms | +33 MB | 73.72 | |
| MMRL | 4992K | 145 ms | 37 ms | 1440 MB | 72.43 | |
| MMRL + LOREAL | +104.5K | +5 ms | +1 ms | +33 MB | 76.56 | |
For adaptability, we summarize the Low-Resolution Cross-dataset Evaluation (LR-CE) benchmark (Table 2): models are trained on ImageNet (16-shot) and evaluated on LR test sets of ImageNet plus ten target datasets at φ ∈ {962, 1442, 1922}. Numbers below are the target average (mean accuracy over the ten target datasets). We also report the Low-Resolution Domain Generalization (LR-DG) benchmark (Table 3): training on ImageNet, testing on LR variants of ImageNet-V2, ImageNet-S, ImageNet-A, and ImageNet-R; the last columns are the target average over those four sets. At φ = 962, LOREAL yields about +12.71% average improvement on LR-DG across backbones (paper).
| Method | φ = 962 | φ = 1442 | φ = 1922 |
|---|---|---|---|
| CoOp | 31.72 | 63.10 | 69.20 |
| CoOp + LOREAL | 45.77 | 69.78 | 74.67 |
| MaPLe | 36.07 | 64.02 | 69.61 |
| MaPLe + LOREAL | 52.77 | 72.61 | 77.19 |
| MMA | 37.27 | 65.32 | 72.28 |
| MMA + LOREAL | 54.66 | 72.99 | 76.81 |
| MMRL | 40.77 | 69.71 | 75.30 |
| MMRL + LOREAL | 55.08 | 74.12 | 79.14 |
| Method | φ = 962 | φ = 1442 | φ = 1922 |
|---|---|---|---|
| CoOp | 15.02 | 43.87 | 53.16 |
| CoOp + LOREAL | 23.27 | 48.32 | 55.93 |
| MaPLe | 10.56 | 41.56 | 52.59 |
| MaPLe + LOREAL | 26.86 | 48.42 | 54.78 |
| MMA | 14.47 | 43.69 | 53.18 |
| MMA + LOREAL | 25.52 | 47.35 | 54.16 |
| MMRL | 15.42 | 47.34 | 57.37 |
| MMRL + LOREAL | 30.64 | 50.53 | 58.47 |
coming soon :)