Adversarial Attacks in Multimodal Systems: A Practitioner's Survey
Journal:
arXiv
Published Date:
May 6, 2025
Abstract
The introduction of multimodal models is a huge step forward in Artificial
Intelligence. A single model is trained to understand multiple modalities:
text, image, video, and audio. Open-source multimodal models have made these
breakthroughs more accessible. However, considering the vast landscape of
adversarial attacks across these modalities, these models also inherit
vulnerabilities of all the modalities, and ultimately, the adversarial threat
amplifies. While broad research is available on possible attacks within or
across these modalities, a practitioner-focused view that outlines attack types
remains absent in the multimodal world. As more Machine Learning Practitioners
adopt, fine-tune, and deploy open-source models in real-world applications,
it's crucial that they can view the threat landscape and take the preventive
actions necessary. This paper addresses the gap by surveying adversarial
attacks targeting all four modalities: text, image, video, and audio. This
survey provides a view of the adversarial attack landscape and presents how
multimodal adversarial threats have evolved. To the best of our knowledge, this
survey is the first comprehensive summarization of the threat landscape in the
multimodal world.