A Deep Learning–Based Project inspired by Russell et al., 2018
Modern software systems are plagued by hidden vulnerabilities such as buffer overflows, null pointer dereferences, and improper input validation.
This project implements a deep representation learning approach to detect vulnerabilities in C/C++ functions by directly interpreting lexed source code, moving beyond rule-based static analyzers.
-
Data Collection
- C/C++ functions from GitHub, Debian packages, and the SATE IV Juliet Test Suite
- 12M+ functions curated and labeled using static analyzers
-
Lexical Representation
- Custom C/C++ lexer → reduces vocabulary to ~156 tokens
- Strips comments, normalizes identifiers, standardizes types
-
Modeling
- Deep neural network trained on function-level lexed code
- Learns semantic representations of vulnerable vs. safe code
-
Evaluation
- Benchmarked on NIST SATE IV and real-world open-source projects
- Detects multiple CWE categories (buffer overflows, null pointer errors, input validation flaws, etc.)
- Deep learning can learn vulnerability signatures directly from raw source code.
- Outperformed traditional static analysis and shallow ML baselines.
- Effective across diverse datasets, showing strong generalization to unseen code.
| CWE Category | Frequency in Dataset | Detection Capability |
|---|---|---|
| Buffer Overflow (CWE-120/121) | 38.2% | High ✅ |
| Memory Bound Errors (CWE-119) | 18.9% | High ✅ |
| NULL Pointer Dereference (476) | 9.5% | High ✅ |
| Pointer Misuse (469) | 2.0% | Moderate |
| Input Validation / Misc. | 31.4% | Variable |
- Languages: Python, C/C++
- Frameworks: PyTorch / TensorFlow
- Tools: Custom lexer, static analyzers, NIST SATE IV dataset
- Concepts: Deep Representation Learning, Token Embeddings, Supervised Classification
- Rebecca L. Russell, Louis Kim, Lei H. Hamilton, Tomo Lazovich, Jacob A. Harer, Onur Ozdemir, Paul M. Ellingwood, Marc W. McConley
Automated Vulnerability Detection in Source Code Using Deep Representation Learning.
arXiv:1807.04320
This project was built as part of my research/academic exploration in secure software engineering. Inspired by Draper’s work on large-scale ML for vulnerability detection.