Lightweight, explainable experimentation on social bot detection using a curated slice of the Cresci 2017 datasets.
TrustNet is a lightweight experimentation space for detecting automated (bot / spam) accounts on Twitter/X using a carefully curated small subset of the wellβknown Cresci 2017 datasets. This repository focuses on approach clarity, feature engineering transparency, and reproducible evaluation.
Modern social platforms face coordinated inauthentic behavior: spambots, follow churners, fake amplifiers. Research datasets (like Cresci 2017) are large; for teaching, prototyping, or rapid iteration a smaller, well-structured sample accelerates idea > model cycles. TrustNet aims to:
- Provide a concise, explainable baseline pipeline.
- Showcase modular feature extraction (profile, content, network, temporal).
- Encourage experimentation with classical ML before jumping to deep architectures.
We extract a reduced slice from the original Cresci 2017 Twitter bot collections:
| File | Purpose | Rows | Label Distribution* |
|---|---|---|---|
Datasets/genuine_users.csv |
Human / legitimate accounts | (small subset) | label = genuine |
Datasets/spam_user.csv |
Spam / bot accounts (selected) | (small subset) | label = spam |
Exact counts intentionally minimized for lightweight experimentation; expand by substituting full Cresci 2017 data.
The original dataset belongs to the authors of the Cresci 2017 paper. Use responsibly; comply with Twitter's TOS and data redistribution norms. This repository only contains derived / subset CSVs for educational purposes.
If you build upon this, cite the Cresci paper:
Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A., & Tesconi, M. (2017). The paradigm-shift of social spambots: Evidence, theories, and tools for the arms race. Proceedings of the 26th International Conference on World Wide Web Companion.
Although the tiny subset may include fewer, typical Cresci-style fields you can expect / extend:
| Column | Description | Type | Example |
|---|---|---|---|
user_id |
Unique user identifier | string/int | 1234567890 |
screen_name |
Handle | string | someuser |
name_len |
Length of display name | int | 12 |
description_len |
Bio length (chars) | int | 87 |
followers_count |
Followers | int | 240 |
friends_count |
Following | int | 310 |
statuses_count |
Total tweets | int | 1543 |
favourites_count |
Likes given | int | 420 |
listed_count |
Public lists membership | int | 3 |
default_profile_image |
Avatar default? | bool/int | 0 |
created_at |
Account creation time | datetime | 2016-02-11 |
avg_tweets_per_day |
Temporal normalized activity | float | 5.3 |
spam_ratio |
Heuristic: spam-like tokens / total | float | 0.18 |
has_url |
Profile URL flag | int | 1 |
label |
Target class | categorical | genuine / spam |
You can generate additional engineered features (see next section) from raw tweet timelines if you integrate more data later.
| Category | Examples | Rationale |
|---|---|---|
| Profile Metadata | age days, profile image flag, bio length, name entropy | Bots often reuse templates, have lower entropy |
| Activity / Temporal | tweets per day, burstiness, inter-tweet std | Automation yields uniform or extreme bursts |
| Network | followers/friends ratio, reciprocal rate | Spam accounts follow aggressively to gain traction |
| Content (Optional) | URL proportion, hashtag density, lexicon similarity | Promotional / malicious payload density |
| Linguistic (Optional) | average token length, emoji rate | Synthetic text differs in distribution |
Keep features explainable; avoid leaking future knowledge (no post-classification timeline stats).
Baseline recommendation:
- Clean & impute: handle nulls (median for numeric, mode for binary).
- Scale numeric features (StandardScaler or RobustScaler).
- Train several classical classifiers: Logistic Regression, Random Forest, Gradient Boosting, XGBoost / LightGBM (optional), SVM (RBF for non-linear patterns).
- Perform stratified 5-fold cross-validation (avoid accuracy obsession; report precision/recall/F1, macro + per-class; add ROC-AUC).
- Calibrate probabilities if deployment requires risk ranking (e.g.,
CalibratedClassifierCV).
| Metric | Why |
|---|---|
| Recall (spam) | Catch more malicious accounts |
| Precision (spam) | Limit false accusations |
| F1 (macro) | Balanced view across skew |
| ROC-AUC | Ranking quality |
| PR-AUC (spam) | Robust under class imbalance |
Extract features β Evaluate β Inspect misclassifications β Add/Refine features β Re-train.
git clone https://github.com/ashu273k/TrustNet.git
cd "TrustNet dataset"python -m venv .venv
source .venv/bin/activate # Windows (WSL) / Linuxpip install pandas numpy scikit-learn matplotlib seaborn jupyterimport pandas as pd
genuine = pd.read_csv('Datasets/genuine_users.csv')
spam = pd.read_csv('Datasets/spam_user.csv')
df = pd.concat([genuine.assign(label='genuine'), spam.assign(label='spam')])
print(df.head())from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
feature_cols = [c for c in df.columns if c not in ['label','user_id','screen_name','created_at']]
X = df[feature_cols]
y = df['label']
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(max_iter=1000))
])
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)
pipe.fit(X_train, y_train)
from sklearn.metrics import f1_score
print('Test F1 (spam class):', f1_score(y_test, pipe.predict(X_test), pos_label='spam'))| Plot | Insight |
|---|---|
| Followers vs Friends scatter | Aggressive following patterns |
| Distribution of account age | Newborn clusters of bots |
| Boxplot of tweets/day by class | Activity intensity |
| ROC curve ensemble | Comparative classifier tradeoffs |
- Add notebook with full feature pipeline
- Integrate tweet-level content features
- Provide benchmarking script (multi-model comparison)
- Add model persistence + inference script
- Experiment with anomaly detection (e.g., Isolation Forest)
Contributions welcome! Feel free to:
- Open an issue (bug / idea / enhancement)
- Fork & branch (
feat/<short-description>) - Submit PR with concise description & before/after metrics
Keep code modular, add comments for any non-obvious feature transformations, and prefer deterministic seeds where possible.
Released under the MIT License. See LICENSE (add one if missing) for details. Original dataset governed by its own termsβrespect them.
Cresci 2017 authors for releasing foundational bot datasets. The open-source ML community for tooling. You for exploring ethical automation detection.
| Idea | Description |
|---|---|
| Semi-supervised refinement | Use confident predictions to pseudo-label unlabeled accounts |
| Graph features | Build small ego networks and compute clustering coeff, assortativity |
| Temporal signatures | FFT / spectral density of posting intervals |
| Text embeddings | Add sentence-transformer vectors; beware leakage / overfitting |
| Model stacking | Blend linear + tree + anomaly detectors |
Made with curiosity and caution β automate detection, not judgment.