MARVIS: Modality Adaptive Reasoning over VISualizations
Benjamin Feuer (Stanford University), Lennart Purucker (Prior Labs), Oussama Elachqar (Oumi), Chinmay Hegde (New York University)
Architectural Patterns & Composition
MARVIS converts latent embeddings from small specialized ML models into visual representations, then uses a VLM's spatial reasoning to make predictions on non-traditional modalities and long-tail domains. It achieves competitive accuracy without requiring raw data exposure or retraining the underlying specialized models.
Presentation
Talk
Paper Session 8: AI Systems in Practice
Friday, May 29 · 1:20 PM – 1:30 PM
Bayshore Ballroom
Poster
Friday, May 29 · 1:45 PM – 3:15 PM
Carmel / Monterey
Abstract
Predictive applications of machine learning often rely on small (sub 1 Bn parameter) specialized models tuned to particular do- mains or modalities. Such models often achieve excellent perfor- mance, but lack flexibility. LLMs and VLMs offer versatility, but typically underperform specialized predictors, especially on non- traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a system that transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to interpret the visualizations and utilize them for pre- dictions successfully. MARVIS achieves competitive performance across vision, audio, biological, and tabular domains using a sin- gle 3B parameter model, yielding results that beat Gemini 2.0 by 16% on average. MARVIS drastically reduces the gap between LLM/VLMs approaches and specialized domain-specific methods, without requiring any domain-specific training. Code and datasets are available at https://github.com/penfever/marvis.