AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model
Authors
Zhiwei Jin
Xiaohui Song
Nan Wang
Yafei Liu
Chao Li
Xin Li
Ruichen Wang
Zhihao Li
Qi Qi
Long Cheng
Dongze Hao
Quanlong Zheng
Yanhao Zhang
Haobo Ji
Jian Ma
Zhitong Zheng
Zhenyi Lin
Haolin Deng
Xin Zou
Xiaojie Yin
Ruilin Wang
Liankai Cai
Haijing Liu
Yuqing Qiu
Ke Chen
Zixian Li
Chi Xie
Huafei Li
Chenxing Li
Chuangchuang Wang
Kai Tang
Zhiguang Zhu
Kai Tang
Wenmei Gao
Rui Wang
Jun Wu
Chao Liu
Qin Xie
Chen Chen
Haonan Lu
Abstract
In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning (QALFT) framework to facilitate efficient task adaptation and model compression during mobile-side deployment of AndesVL. Moreover, utilizing our cache eviction algorithm -- OKV -- along with customized speculative decoding and compression strategies, we achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips. We release all models on https://huggingface.co/OPPOer.