Skip to content

Wadalisa/Structured-Based-GP

Repository files navigation

Structured-Based-GP

Description of Data Set

Dataset Statistics

Title of Data Set: hepatitis

Features : 20 of which there are 3 binary features, 11 categorical features and 6 continuous features

Observations : 155

Attribute Characteristics: Categorical

Missing/NaN Values: None

Duplicate rows : None

The dataset is pre-processed by encoding binary columns such as the STERIOD field, converting the value, 1:Yes and 0:No. Missing values of the binary columns are imputed with the mode of the columns, if applicable. The continuous columns are then normalized with the use of Min-Max normalization, to balance the data and ensure equality in scale. Missing values are imputed using the median value of the column. The dataset is the split into a 80/20 split. This dataset is used on both the regular GP and the structured GP.

Representation

In genetic programming, a tree-like structure is used to represent the classifier. These trees represent programs, which are then evolved to find the best classifier to make a prediction. This presentation is selected as tree-based structures have adaptable architecture. The representation used in this document is the syntax tree.

The tree node implementation is used in the generation of an individual. The initial population is generated using the growth method. The growth method is attentive to the arity of the node, meaning the arity of the node plays a part in how many arguments/children the node takes.

This method is used to create individuals of population size n.

Fitness Function

Regular GP

The fitness function is calculated using the F1-Score. The F-Score metric makes use of the precision and recall, prefering accuracy in class-imbalaced data. It is often described as the "harmonic mean." The F1-Score is used as due to its 'knowledge' on precision and recall avoiding the over-estimation that a normal mean would cause.

Structured GP

In structured GP, the fitness function also utilizes the F1 score formula above. However, the fitness function used in the structured GP, the structure diversity is taken into account. Prioritizing uniqueness in a population, thus giving individuals with more uniqueness higher structural fitness. Therefore allowing the new fitness function to be both behavioral and structural aware.

Selection Method

Regular GP

The tournament selection applied in this GP, works without replacement. This means that once a winner is selected, they are then removed from the population, to avoid being selected again. This assists with further exploration of the search space as different children can be produced with the help of genetic operators, increasing diversity.

Structured GP

The tournament selection applied in this GP. However, the individuals compete based on the fitness and the diversity of that individual. A score is then calculated using the average similarity of the selected participants.

Genetic Operators

Genetic operators assist with creating 'children' in the population from the parents.

In this report, Crossover and Mutation are used. The combination of these genetic operators allows for both the exploitation and exploration of the search space. Crossover will be use for exploitation in the search space meaning that cross over will combine two 'good' attributes from the parent trees as per fitness. whereas Mutation will be used for exploration to introduce variety and new parts of the search space.

Regular GP

Elitism

Elitism is a genetic operator which is used to keep the individuals with the best fitness scores, for the next generation. 10% of the population is selected during Elitism for the new population.

Swap Mutation Operator

The mutation operator will locate two random points in a individual, and swap both of these points with the other. This is called the swap mutation, which will be applied at a rate to control the exploration done by the GP.

Sub-tree Crossover Operator

The crossover operator will swap out sub-trees of the parents. It locates a random point in the sub-trees of the parent trees then the selected sub-trees will be "crossed over" to the other parent tree. This method is called subtree crossover. The rate is used to determine how much crossover is applied to the population.

Structured GP

Structural Elitism

Elitism is a genetic operator which is used to keep the individuals with the best fitness scores and unique structures, for the next generation.

A Diverse Swap Mutation Operator

The mutation operator will locate two random points in a individual, and swap both of these points with the other. If the average similarity of the population is higher than 50% the rate of mutation gets increased to increase exploration of the search space.

A Structured Sub-tree Crossover Operator

The crossover operator will swap out sub-trees of the parents. A comparison of the parents is made of how unique and diverse they are. Sub-tree Crossover occurs only when the parents selected are found to be over the similarity threshold.

Termination Criterion

The number of generation is the termination of both the regular GP and the structural GP.

Description of The Structured GP

Structured Genetic Programming (SGP) is an advanced evolutionary algorithm that enhances traditional GP by incorporating structural awareness into the selection and variation processes. The provided implements a tree similarity metric that compares program structures at both node and subtree levels, a fitness function combining behavioral performance (F1-score) with structural uniqueness, and adaptive operators that dynamically adjust based on population diversity. The algorithm begins by generating an initial population using a growth method that creates syntax trees with configurable depth, where terminal nodes represent features/constants and internal nodes contain logical/arithmetic operators. During evolution, selection prioritizes individuals that balance high classification accuracy with structural distinctiveness, using tournament selection that scores candidates based on both criteria. Crossover and mutation operations are strategically guided by similarity thresholds - for instance, crossover is suppressed between overly similar parents to maintain diversity, while mutation rates increase when population similarity exceeds 50% to prevent stagnation.

Parameters

Standard GP Parameters

num_features = len(dataset.columns)-1 #number ofcolumns

terminal = [f"x{i}" for i in range(1, num_features+1)] # for n number of x features x1, x2,..., xn, c

function = ['and', 'or', 'not', '>', '<', '==','if','+','-','*','/'] # Logical/comparison operators2

max_depth = 2

population_size = 250

generations = 52

mutation_rate = 0.10

crossover_rate = 0.8

elitism_rate = 0.10

tournament_size = 4

Structured GP Parameters

num_features = len(dataset.columns)-1 #number ofcolumns

terminal = [f"x{i}" for i in range(1, num_features+1)] # for n number of x features x1, x2,..., xn, c

function = ['and', 'or', 'not', '>', '<', '==','if','+','-','*','/'] # Logical/comparison operators

max_depth = 2

population_size = 500

generations = 52

mutation_rate = 0.10

crossover_rate = 0.8

elitism_rate = 0.10

beta=0.7

alpha = 0.3 # Weight for structural fitness

similarity_threshold = 0.6 # For crossover suppression

min_diversity = 0.2 # Threshold to trigger diversity measures

boost_factor=1.5 #to boost diversifing mutation

tournament_size = 4

About

Structured based GP

Topics

Resources

Stars

Watchers

Forks

Languages