Despite the proven efficacy of large language models (LLMs) like GPT in numerous applications, concerns have emerged regarding their exploitation in creating phishing emails or network intrusions, which have shown to be detrimental. The multimodal functionalities of large vision-language models (LVLMs) enable them to grasp visual commonsense knowledge. This study investigates the feasibility of using two widely available commercial LVLMs, LLAVA, and multimodal GPT4, for effectively bypassing CAPTCHAs or producing bot-driven fraud through malicious prompts. It was found that these LVLMs can interpret and respond to the visual information presented in image, puzzle, and text-based CAPTCHA and reCAPTCHA, thereby potentially circumventing the challenge-response authentication security measure. This capability suggests that such systems could facilitate unauthorized access to secured accounts via remote digital methods. Remarkably, these attacks can be executed with the standard, unaltered versions of the LVLMs, eliminating the need for previous adversarial methods like jailbreaking.
Related links
Details
Title
Seeing the Unseen
Publication Details
2024 IEEE International Conference on Big Data (BigData), pp.5664-5673
Resource Type
Conference proceeding
Conference
IEEE International Conference on Big Data (Washington, DC, USA, 12/15/2024–12/18/2024)