Transformer-Based Code Snippet Classifier

title	emoji	colorFrom	colorTo	sdk	app_file	pinned	short_description
Code Snippet Prediction	🧠	indigo	pink	docker	Dockerfile	false	Code Snippet Language Prediction using Transformers

Transformer-Based Code Snippet Classifier

This project implements a model with a transformer architecture for classifying code snippets using a custom tokenizer and PyTorch. It includes training, evaluation, and visualization of attention heads for interpretability.

Overview

Custom tokenizer training using the Hugginface library
Transformer model implemented from scratch using PyTorch and following Andrej Karpathy's videos <3
Demo build with Gradio library mounted on FastAPI app with attention visualization via BertViz
What models are implemented is described in Models section section.

Example

Live demo is available on HuggingFace Spaces here or via directed URL(directed URL is better for visualizing attention, because HuggingFace clips the available space on a web page).

Below is a screenshot of the demo app in action:

Attention visualization page (sorry for the page looking so bad, I am not that good with CSS :()

Dataset

The dataset used is publicly available here. This dataset contains labeled code snippets used for classification tasks. It is also available in /datasets folder.

All available languages to predict from:

c
c++
css
html
java
javascript
python
r
sqlite

Notes

This project was done for experimentation and learning. It does not use any pretrained big models.
The tokenizer retains readable words to make attention visualization more interpretable.
Goal was not build super deep and big model but to experiment, therefore training was done on CPU

Models

Model Version and Progress

Throughout the project, multiple versions of the model were developed, each adding complexity and performance improvements. Models share that they predict based on [CLS] token inserted during the tokenization on first position in the code snippet.

Every model implementation you can find in the notebooks/transformer.ipynb jupyter notebook with analysis of the training in the end.

Model Comparison Table

Version	Description	Accuracy on Validation Data	Complexity	Throughput
Basic Model V1	Embedding + linear classifier	~11.0%	Low	~105k tokens/sec
Position Embedding Model V2	Embedding + positional encoding + linear	~11.0%	Low-Medium	~100k tokens/sec
Attention-Based Model	Embedding + positional encoding + 1 Self-attention head + 1 linear	92%	Medium	~65k tokens/sec
Multi-Head Attention-Based Model	Embedding + positional encoding + Multiple Self-attention heads + 1 linear	91%	Medium-High	~40k tokens/sec
Full Transformer Encoder	Multi-layer encoder with attention, FFN, and residuals	87.6%	High	~20k tokens/sec

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
datasets		datasets
models		models
notebooks		notebooks
src		src
static		static
templates		templates
tokenizer		tokenizer
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformer-Based Code Snippet Classifier

Overview

Example

Dataset

Notes

Models

Model Version and Progress

Model Comparison Table

About

Uh oh!

Releases

Packages

Languages

xsenyaaax/CodeBERT-nano

Folders and files

Latest commit

History

Repository files navigation

Transformer-Based Code Snippet Classifier

Overview

Example

Dataset

Notes

Models

Model Version and Progress

Model Comparison Table

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages