A full-stack machine learning platform for detecting outliers in survey and tabular data using autoencoders. Upload your CSV data through a web interface and receive outlier analysis powered by Vertex AI.
The pipeline consists of the following steps:
- data_loader - loads the dataset and returns a processed dataframe that can then be encoded/decoded
- autoencoder - sets up the architecture of the autoencoder
- hyperparameter_tuning - finds the optimal parameters for the neural network
- measure_accuracy - measures how well we can reconstruct each attribute of the dataset
- find_outliers - examines which rows are most difficult to reconstruct (likely inattentive/mischievous participants)
- Web-Based Upload: Drag-and-drop CSV file upload with instant preview
- Automatic Column Filtering: "Rule of 9" filter - automatically selects columns with 1-9 unique values (categorical variables suitable for autoencoder analysis)
- Cloud ML Pipeline: Training runs on Google Cloud Vertex AI for scalable processing
- Outlier Detection: Identifies rows with highest reconstruction error
- Results Visualization: View outlier scores, dropped columns, and detailed analysis
| Layer | Technologies |
|---|---|
| Frontend | React 18, Vite, TypeScript, Tailwind CSS, shadcn/ui |
| API Server | Express.js, TypeScript, Multer |
| ML Pipeline | Python, TensorFlow/Keras, pandas, NumPy |
| Cloud Services | Google Cloud Storage, Pub/Sub, Firestore, Vertex AI |
React UI → Express API → GCS → Pub/Sub → Worker → Vertex AI → Firestore → Results
Data Flow:
- User uploads CSV through React frontend
- Express server generates signed URL, file uploads directly to GCS
- Job metadata saved to Firestore, message published to Pub/Sub
- Worker receives message, triggers Vertex AI CustomContainerTrainingJob
- Vertex AI container downloads data, trains autoencoder, computes outlier scores
- Results written to Firestore
- Frontend polls for completion, displays results
- Python 3.10+
- Node.js 18+
- Google Cloud SDK with authenticated service account
- GCP project with: Cloud Storage, Pub/Sub, Firestore, Vertex AI APIs enabled
Root .env:
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
GOOGLE_CLOUD_PROJECT=your-project-idfrontend/.env:
GOOGLE_CLOUD_PROJECT=your-project-id
GCS_BUCKET_NAME=your-bucket-name
PUBSUB_TOPIC=job-upload-topicTerminal 1 - Python Worker:
source venv/bin/activate
python worker.pyTerminal 2 - Express API:
cd frontend
npm install
npm run dev:server
# Runs on http://localhost:5001Terminal 3 - React UI:
cd frontend
npm run dev
# Runs on http://localhost:5173The main file is main.py. You can run various entrypoints by choosing the appropriate arguments.
python main.py trainParameters to set:
--seed: Seed for reproducibility. Default:2.--model_name: Model to train. Choose between two available values:AEfor the simple autoencoder andVAEfor a variational autoencoder.--prior: Prior to use for the variational autoencoder. Choose between two available values:gaussianfor a Gaussian prior andgumbelfor applying the gumbel softmax.--data: Dataset to train the model on. Default:sadc_2017.--config: Configuration file for the model training. This should contain the hyperparameters for the model. You can find an example of the configuration file in theconfigfolder and the files that havesimpleas a prefix.--output: Output folder to save the outputs. Default:cache/simple_model/.
The output of this command is a trained model that can be used for evaluation and is stored in the output folder under the autoencoder subfolder.
Also, it generates a .npy file of the history as produced by the keras fit method for future use, as well as plots of training and validation losses;
Specifically, for AE it stores the reconstruction loss plot, whereas for VAE the reconstruction loss, the kl loss, as well as the total cost of these two.
python main.py search_hyperparametersParameters to set:
--seed: Seed for reproducibility. Default:2.--model_name: Model to search hyperparameters for. Choose between two available values:AEfor the simple autoencoder andVAEfor a variational autoencoder.--prior: Prior to use for the variational autoencoder. Choose between two available values:gaussianfor a Gaussian prior andgumbelfor applying the gumbel softmax.--data: Dataset to train the model on and search for hyperparameters. Default:sadc_2017.--config: Configuration file for the model hyperparameters searching. This should contain the hyperparameters for the model. You can find an example of the configuration file in theconfigfolder and the files that havehpas a prefix.--output: Output folder to save the outputs. Default:cache/simple_model/.
The output of this command is a yaml file that contains the best hyperparameters found during the search. This file is stored in the output folder in a file named as best_hyperparameters.yaml.
python main.py evaluateParameters to set:
--seed: Seed for reproducibility. Default:2.--model_path: Path to the trained model you have stored. It should be a folder as the one that is created duringtraincommand. Default:cache/simple_model/autoencoder.--data: Dataset to evaluate the model on. Default:sadc_2017.--output: Output folder to save the outputs. Default:cache/predictions/.
The output of this command is two .csv files;
One that contains the metrics as computed per variable/attribute of the data and is stored in the output folder under the name metrics.csv,
and one that contains the average metrics of all the variables/attributes of the data and is stored in the output folder under the name averages.csv.
python main.py find_outliersParameters to set:
--seed: Seed for reproducibility. Default:2.--model_path: Path to the trained model you have stored. It should be a folder as the one that is created duringtraincommand. Default:cache/simple_model/autoencoder.--prior: Prior to use for the variational autoencoder. Choose between two available values:gaussianfor a Gaussian prior andgumbelfor applying the gumbel softmax.--data: Dataset to evaluate the model on. Default:sadc_2017.--k: Weight of the kl loss in the total loss if the model isVAE. Default:1.--output: Output folder to save the outputs. Default:cache/predictions/.
The output of this command is a .csv file that contains the reconstruction loss (or reconstruction, kl, and total loss in case of VAE) of each row of the data and is stored in decreasing order in the output folder under the name errors.csv.
google-auth version mismatch:
ERROR: google-auth-oauthlib 1.2.3 requires google-auth<2.42.0
Fix: Remove version pins from google-auth and google-auth-oauthlib in requirements.txt to let pip resolve compatible versions.
llvmlite build failure (LLVM not found):
CMake Error: LLVMConfig.cmake not found
Fix: Install pre-built binaries:
pip install --only-binary=:all: llvmlite==0.46.0 numba==0.63.1TensorFlow AVX crash:
The standard tensorflow==2.15.1 package requires AVX instructions not available on Apple Silicon.
For the worker (no TensorFlow needed):
The worker.py script only dispatches jobs to Vertex AI - it doesn't need TensorFlow locally. Ensure no TensorFlow imports exist in worker.py.
For local training:
pip uninstall tensorflow
pip install tensorflow-macos tensorflow-metalNote: Do not add tensorflow-macos to requirements.txt - it breaks the cloud build. This is for local development only.
If you see errors about missing resources or permissions, verify your credentials point to the correct project:
echo $GOOGLE_APPLICATION_CREDENTIALS
cat $GOOGLE_APPLICATION_CREDENTIALS | jq .project_id- Cold Start Latency: Vertex AI jobs take ~10-15 minutes due to container initialization
- No Authentication: API endpoints are unauthenticated (deploy on private network)
- No Job Cancellation: Cannot cancel running jobs from UI (use GCP Console)



