Skip to content

fifamaster1212/PyML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PyML | Domain-Specific Language for Machine Learning

PyML is a domain-specific language (DSL) designed to simplify common machine learning workflows. Built on top of Python, pandas, scikit-learn, and matplotlib, PyML aims to provide data scientists and analysts with an intuitive language for tasks such as loading data, preprocessing, training models, evaluating performance, and visualizing results. We use a sample dataset of housing prices from Kaggle in the examples below.

Table of Contents

Introduction

Machine learning pipelines often require a series of repetitive steps: loading datasets, handling missing values, encoding categorical features, splitting data into training and testing sets, selecting models, training, evaluating performance, and generating predictions. PyML was created to streamline these processes through a concise, human-readable DSL.

PyML leverages Python’s scientific stack under the hood, enabling you to harness the power of pandas, numpy, scikit-learn, and matplotlib with straightforward commands. This allows both beginners and experienced practitioners to quickly set up and run ML experiments without the boilerplate of traditional Python scripts.

Features

  • Data Loading: Easily load CSV datasets into memory.
  • Preprocessing: Handle missing values, filter data by conditions, and normalize numeric columns.
  • Feature Selection: Define target and features directly from the DSL.
  • Model Creation: Build linear models (Ridge, Lasso) and Logistic Regression models with simple commands.
  • Training and Evaluation: Train models and evaluate performance using MSE and R² scores.
  • Visualization: Plot data relationships directly for quick insights.
  • Describing Columns: Inspect column statistics and types for better data understanding.
  • Prediction: Generate predictions and save them to CSV files.

Language Grammar

PyML’s grammar is command-driven. Each line represents one operation. Most commands follow the pattern:

load "filename.csv"                 # Load a dataset
target "ColumnName"                 # Set target column
features ["col1", "col2", ...]      # Set feature columns
missing fill 0                      # Fill missing values with 0
split ratio=0.8 shuffle             # Split data into train/test sets
model ridge alpha=1.0               # Create a ridge regression model
train                               # Train the current model
r2 train                            # Calculate R² on train set
r2 test                             # Calculate R² on test set
describe "ColumnName"               # Describe a single column
describe ["col1", "col2"]           # Describe multiple columns
type "ColumnName"                   # Show the data type of a single column
type ["col1", "col2"]               # Show data types of multiple columns
normalize ["col1", "col2"] method=zscore   # Normalize specified columns
plot x="colX" y="colY"              # Plot one column against another
predict save="predictions.csv"      # Predict on test set and save to CSV
feature "NewFeature" = "col1 + col2"  # Add a derived feature

Installation

Prerequisites:

  • Python 3.x
  • pandas, numpy, scikit-learn, matplotlib

Installation Steps:

  1. Clone the Repository:

    git clone https://github.com/fifamaster1212/PyML.git
  2. Install Dependencies:

    cd pyml
    pip install pandas numpy scikit-learn matplotlib

Usage

You can run PyML commands interactively using the provided REPL (main() function) or by passing a script file containing PyML commands line-by-line.

Interactive REPL:

python3 PyMLinterpreter.py

When the REPL starts, type commands one at a time. For example:

>>> load "housing_data.csv"
>>> target "SalePrice"
>>> features ["LotArea", "OverallQual", "YearBuilt"]
>>> split ratio=0.8 shuffle
>>> model ridge alpha=1.0
>>> train
>>> r2 test
>>> predict save="predictions.csv"
>>> exit

Examples

Data Loading and Preprocessing

load "housing_data.csv"
missing fill 0
filter_data "LotArea > 5000 and OverallQual >= 7"
normalize ["LotArea", "GrLivArea"] method=minmax
describe ["LotArea", "GrLivArea"]
type ["LotArea", "OverallQual"]

Explanation:

  • Loads a housing dataset.
  • Fills missing values with 0.
  • Filters rows based on given conditions.
  • Normalizes numeric features.
  • Describes columns and checks their data types.

Model Training and Evaluation

load "housing_data.csv"
target "SalePrice"
features ["LotArea", "OverallQual", "YearBuilt", "GrLivArea"]
split ratio=0.8 shuffle
model ridge alpha=1.0
train
r2 train
r2 test
error
predict save="predictions.csv"

Explanation:

  • Sets the target and features.
  • Splits the data into training and testing sets.
  • Creates a Ridge model, trains it, and evaluates it using R² and MSE.
  • Makes predictions on the test set and saves them.

Feature Engineering

load "housing_data.csv"
feature "TotalArea" = "LotArea + GrLivArea + TotalBsmtSF"
describe "TotalArea"

Explanation:

  • Adds a new feature (TotalArea) calculated from existing columns.
  • Describes the new feature to confirm it was created correctly.

Plotting

load "housing_data.csv"
plot x="OverallQual" y="SalePrice"

Explanation:

  • Loads the data and plots the relationship between overall quality and sale price.

Column Descriptions and Types

load "housing_data.csv"
describe ["LotArea", "SalePrice"]
type ["OverallQual", "YearBuilt"]

Explanation:

  • Provides statistical descriptions for specified columns.
  • Prints the data types of the specified columns.

Examples for Functions with Parameters

Handling Missing Values

missing fill 0
missing drop
missing mean

Explanation:

  • missing fill 0: Replaces all missing values with 0.
  • missing drop: Removes rows with missing values.
  • missing mean: Replaces missing values with the mean of the respective column.

Splitting Data

split ratio=0.8 shuffle
split ratio=0.7 noshuffle

Explanation:

  • split ratio=0.8 shuffle: Splits the data into 80% training and 20% testing with shuffling.
  • split ratio=0.7 noshuffle: Splits the data into 70% training and 30% testing without shuffling.

Normalizing Data

normalize ["LotArea", "GrLivArea"] method=minmax
normalize ["TotalBsmtSF"] method=zscore

Explanation:

  • normalize ["LotArea", "GrLivArea"] method=minmax: Scales specified columns to a 0–1 range.
  • normalize ["TotalBsmtSF"] method=zscore: Scales the specified column using z-score normalization.

Testing Suite

PyML includes a unittest-based testing suite that verifies functionality for loading data, setting targets, selecting features, training models, evaluating performance, and more.

Running Tests:

python3 -m unittest tests.py 

Potential Future Additions

Extended Model Support:

  • Add more models (e.g., decision trees, random forests, gradient boosting).

Enhanced Visualization:

  • Include histograms, scatter plots with regression lines, and correlation heatmaps.

Advanced Preprocessing:

  • Include one-hot encoding, label encoding, and scaling options beyond min-max and z-score.

Time Series and Deep Learning:

  • Add support for time-series analyses and incorporate deep learning frameworks for more advanced tasks.

Improved Error Handling and Logging:

  • Provide more descriptive error messages and logging for easier debugging.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages