Download PDFOpen PDF in browserConcept Attribution and Dual Explainability in Vision-Language ModelsEasyChair Preprint 160138 pages•Date: March 23, 2026AbstractIn this paper, we investigate methods for applying explainability to Vision-Language Models (VLMs). The main problem is given by the absence of a common representation between text and image, thus it is necessary to determine a modality for aligning the two information streams. Our research is particularized on the Contrastive Language-Image Pretraining (CLIP) model whose architecture is based on two independent encoders, both using Transformers. We propose a Concept Attribution method that addresses the problem of fusing visual encoder signals with textual gradients. This method is optimized by hybridizing the current model with a Large Language Model (LLM), which provides a developed linguistic context and implic itly an increase in the precision of explanations for multimodal reasoning. The fusion process is controlled through parameters that balance the textual and visual influence on explainability. Furthermore, for an objective validation of the results, we integrated Faithfulness Metrics, responsible for both analyzing the visual response and refining its visual representation through heatmaps. In addition, we demonstrated the utility of the method by applying it in a complex Dual Explainability system, which validates the coherence between the prompt given by the user and the model response. Keyphrases: CLIP, Explainability, MLLM, VLM, transformers
|

