A Haskell library (based on eggp which is in turn based on srtree) for symbolic regression on DataFrames. Automatically discover mathematical expressions that best fit your data using genetic programming with e-graph optimization.
symbolic-regression integrates symbolic regression capabilities into a DataFrame workflow. Given a target column and a dataset, it evolves mathematical expressions that predict the target variable, returning a Pareto front of expressions trading off complexity and accuracy.
ghci> import qualified DataFrame as D
ghci> import qualified DataFrame.Functions as F
ghci> import Symbolic.Regression
-- Load your data
ghci> df <- D.readParquet "./data/mtcars.parquet"
-- Define mpg as a column reference
ghci> let mpg = F.col "mpg"
-- Run symbolic regression to predict 'mpg'
-- NOTE: ALL COLUMNS MUST BE CONVERTED TO DOUBLE FIRST
-- e.g df' = D.derive "some_column" (F.toDouble (F.col @Int "some_column")) df
-- Symbolic regression will by default only use the double columns
-- otherwise.
ghci> exprs <- fit defaultRegressionConfig mpg df
-- View discovered expressions (Pareto front from simplest to most complex)
ghci> mapM_ (\(i, e) -> putStrLn $ "Model " ++ show i ++ ": " ++ D.prettyPrint e) (zip [1..] exprs)
-- Create named expressions for different complexity levels
ghci> import qualified Data.Text as T
ghci> let levels = zipWith (F..=) (map (T.pack . ("level_" ++) . show) [1..]) exprs
-- Show the various predictions in our dataframe
ghci> let df' = D.deriveMany levels df
-- Or pick the best one for prediction
ghci> let df'' = D.derive "prediction" (last exprs) df'
-- Display the results
ghci> D.display (D.DisplayOptions 5) df''Customize the search with RegressionConfig:
data RegressionConfig = RegressionConfig
{ generations :: Int -- Number of evolutionary generations (default: 100)
, maxExpressionSize :: Int -- Maximum tree depth/complexity (default: 5)
, numFolds :: Int -- Cross-validation folds (default: 3)
, showTrace :: Bool -- Print progress during evolution (default: True)
, lossFunction :: Distribution -- MSE, Gaussian, Poisson, etc. (default: MSE)
, numOptimisationIterations :: Int -- Parameter optimization iterations (default: 30)
, numParameterRetries :: Int -- Retries for parameter fitting (default: 2)
, populationSize :: Int -- Population size (default: 100)
, tournamentSize :: Int -- Tournament selection size (default: 3)
, crossoverProbability :: Double -- Crossover rate (default: 0.95)
, mutationProbability :: Double -- Mutation rate (default: 0.3)
, unaryFunctions :: [...] -- Unary operations to include
, binaryFunctions :: [...] -- Binary operations to include
, numParams :: Int -- Number of parameters (-1 for auto)
, generational :: Bool -- Use generational replacement (default: False)
, simplifyExpressions :: Bool -- Simplify output expressions (default: True)
, maxTime :: Int -- Time limit in seconds (-1 for none)
, dumpTo :: String -- Save e-graph state to file
, loadFrom :: String -- Load e-graph state from file
}myConfig :: RegressionConfig
myConfig = defaultRegressionConfig
{ generations = 200
, maxExpressionSize = 7
, populationSize = 200
}
exprs <- fit myConfig targetColumn dffit returns a list of expressions representing the Pareto front, ordered by complexity (simplest first). Each expression:
- Is a valid
Expr Doublethat can be used with DataFrame operations - Represents a different trade-off between simplicity and accuracy
- Has optimized numerical constants
- Genetic Programming: Evolves a population of expression trees through selection, crossover, and mutation
- E-graph Optimization: Uses equality saturation to discover equivalent expressions and simplify
- Parameter Optimization: Fits numerical constants using nonlinear optimization
- Pareto Selection: Returns expressions across the complexity-accuracy frontier
To install symbolic-regression you'll need:
- libz:
sudo apt install libz-dev - libnlopt:
sudo apt install libnlopt-dev - libgmp:
sudo apt install libgmp-dev
For Nix users with flakes enabled:
git clone <repo-url>
cd symbolic-regression
nix develop -c cabal replThen follow the Quick Start example above.