Title of Data Set: hepatitis
Features : 20 of which there are 3 binary features, 11 categorical features and 6 continuous features
Observations : 155
Attribute Characteristics: Categorical
Missing/NaN Values: None
Duplicate rows : None
The dataset is pre-processed by encoding binary columns such as the STERIOD field, converting the value, 1:Yes and 0:No. Missing values of the binary columns are imputed with the mode of the columns, if applicable. The continuous columns are then normalized with the use of Min-Max normalization, to balance the data and ensure equality in scale. Missing values are imputed using the median value of the column. The dataset is the split into a 80/20 split. This dataset is used on both the regular GP and the structured GP.
In genetic programming, a tree-like structure is used to represent the classifier. These trees represent programs, which are then evolved to find the best classifier to make a prediction. This presentation is selected as tree-based structures have adaptable architecture. The representation used in this document is the syntax tree.
The tree node implementation is used in the generation of an individual. The initial population is generated using the growth method. The growth method is attentive to the arity of the node, meaning the arity of the node plays a part in how many arguments/children the node takes.
This method is used to create individuals of population size n.
The fitness function is calculated using the F1-Score. The F-Score metric makes use of the precision and recall, prefering accuracy in class-imbalaced data. It is often described as the "harmonic mean." The F1-Score is used as due to its 'knowledge' on precision and recall avoiding the over-estimation that a normal mean would cause.
In structured GP, the fitness function also utilizes the F1 score formula above. However, the fitness function used in the structured GP, the structure diversity is taken into account. Prioritizing uniqueness in a population, thus giving individuals with more uniqueness higher structural fitness. Therefore allowing the new fitness function to be both behavioral and structural aware.
The tournament selection applied in this GP, works without replacement. This means that once a winner is selected, they are then removed from the population, to avoid being selected again. This assists with further exploration of the search space as different children can be produced with the help of genetic operators, increasing diversity.
The tournament selection applied in this GP. However, the individuals compete based on the fitness and the diversity of that individual. A score is then calculated using the average similarity of the selected participants.
Genetic operators assist with creating 'children' in the population from the parents.
In this report, Crossover and Mutation are used. The combination of these genetic operators allows for both the exploitation and exploration of the search space. Crossover will be use for exploitation in the search space meaning that cross over will combine two 'good' attributes from the parent trees as per fitness. whereas Mutation will be used for exploration to introduce variety and new parts of the search space.
Elitism is a genetic operator which is used to keep the individuals with the best fitness scores, for the next generation. 10% of the population is selected during Elitism for the new population.
The mutation operator will locate two random points in a individual, and swap both of these points with the other. This is called the swap mutation, which will be applied at a rate to control the exploration done by the GP.
The crossover operator will swap out sub-trees of the parents. It locates a random point in the sub-trees of the parent trees then the selected sub-trees will be "crossed over" to the other parent tree. This method is called subtree crossover. The rate is used to determine how much crossover is applied to the population.
Elitism is a genetic operator which is used to keep the individuals with the best fitness scores and unique structures, for the next generation.
The mutation operator will locate two random points in a individual, and swap both of these points with the other. If the average similarity of the population is higher than 50% the rate of mutation gets increased to increase exploration of the search space.
The crossover operator will swap out sub-trees of the parents. A comparison of the parents is made of how unique and diverse they are. Sub-tree Crossover occurs only when the parents selected are found to be over the similarity threshold.
The number of generation is the termination of both the regular GP and the structural GP.
Structured Genetic Programming (SGP) is an advanced evolutionary algorithm that enhances traditional GP by incorporating structural awareness into the selection and variation processes. The provided implements a tree similarity metric that compares program structures at both node and subtree levels, a fitness function combining behavioral performance (F1-score) with structural uniqueness, and adaptive operators that dynamically adjust based on population diversity. The algorithm begins by generating an initial population using a growth method that creates syntax trees with configurable depth, where terminal nodes represent features/constants and internal nodes contain logical/arithmetic operators. During evolution, selection prioritizes individuals that balance high classification accuracy with structural distinctiveness, using tournament selection that scores candidates based on both criteria. Crossover and mutation operations are strategically guided by similarity thresholds - for instance, crossover is suppressed between overly similar parents to maintain diversity, while mutation rates increase when population similarity exceeds 50% to prevent stagnation.
num_features = len(dataset.columns)-1 #number ofcolumns
terminal = [f"x{i}" for i in range(1, num_features+1)] # for n number of x features x1, x2,..., xn, c
function = ['and', 'or', 'not', '>', '<', '==','if','+','-','*','/'] # Logical/comparison operators2
max_depth = 2
population_size = 250
generations = 52
mutation_rate = 0.10
crossover_rate = 0.8
elitism_rate = 0.10
tournament_size = 4
num_features = len(dataset.columns)-1 #number ofcolumns
terminal = [f"x{i}" for i in range(1, num_features+1)] # for n number of x features x1, x2,..., xn, c
function = ['and', 'or', 'not', '>', '<', '==','if','+','-','*','/'] # Logical/comparison operators
max_depth = 2
population_size = 500
generations = 52
mutation_rate = 0.10
crossover_rate = 0.8
elitism_rate = 0.10
beta=0.7
alpha = 0.3 # Weight for structural fitness
similarity_threshold = 0.6 # For crossover suppression
min_diversity = 0.2 # Threshold to trigger diversity measures
boost_factor=1.5 #to boost diversifing mutation
tournament_size = 4