Cyber-physical systems (CPS) underpin critical domains from hospitals and aircraft to energy grids and factories. When software weaknesses leak into the physical world, failures cascade into safety, economic, and national-security consequences. However, existing vulnerability classification datasets suffer from two critical limitations: missing or coarse-grained Common Weakness Enumeration (CWE) labels, and templated phrasing such as "improper input validation" or "insufficient authorization" leaks across training and test splits, inflating accuracy. We develop a contamination-aware pipeline that makes these flaws measurable and correctable. Our procedure repairs CWE gaps with auditable, hierarchy-aware heuristics, computes per-record contamination scores using near-duplicate detection and boilerplate lexicons, and aggregates weaknesses into the Seven Pernicious Kingdoms (SPK) for stability and interpretability. On a fixed stratified split, a Term Frequency-Inverse Document Frequency (TF-IDF) + Random Forest baseline reaches 76% accuracy, surpassing a frozen five-encoder ensemble. Disabling contamination weighting raises the ensemble to 74 %, demonstrating that evaluation governance, not just model architecture, determines outcomes. Recent turbulence in the National Vulnerability Database (NVD), including program transitions and backlogs in 2024-2025, highlights why transparent, reproducible governance is essential. Our work provides an auditable methodology: when contamination is exposed and controlled, classical baselines remain consistently competitive, and governance choices become the decisive variable.
Related links
Details
Title
Contamination-Aware, Taxonomy-Driven Vulnerability Classification for CPS
Publication Details
IEEE International Conference on Big Data, (2025), pp.4298-4304
Resource Type
Conference proceeding
Conference
IEEE International Conference on Big Data (BigData) (Macau, China, 12/08/2025–12/11/2025)