This work introduces an novel approach to improving cybersecurity systems to focus on spam email-based cyberattacks. The proposed technique tackles the challenge of training Machine Learning (ML) models with limited data samples by leveraging Bidirectional Encoder Representations from Transformers (BERT) for contextualized embeddings. Unlike traditional embedding methods, BERT offers a nuanced representation of smaller datasets, enabling more effective ML model training. The methodology will use several pretrained BERT models for generating contextualized embeddings using data samples, and these embeddings will be fed to various ML algorithms for effective training. This approach demonstrates that even with scarce data, BERT embeddings significantly enhance model performance compared to conventional embedding approaches like Word2Vec. The technique proves especially advantageous for insufficient instances of high-quality dataset. The result of this proposed work outperforms traditional techniques to mitigate phishing attacks with few data samples. This work provides a robust accuracy of 99.25% when we use multilingual BERT (M-BERT) to embed dataset.
Related links
Details
Title
Large Language Model can Reduce the Necessity of Using Large Data Samples for Training Models
Publication Details
Proceedings 2025 IEEE Conference on Artificial Intelligence (CAI), pp.988-991
Resource Type
Conference proceeding
Conference
IEEE Conference on Artificial Intelligence (CAI) (Santa Clara, California, USA, 05/05/2025–05/07/2025)
Publisher
Institute of Electrical and Electronics Engineering (IEEE)
Grant note
1946442,2433800 / National Institutes of Health (10.13039/100000002)
National Science Foundation (10.13039/100000001)