In a given iteration, will every strain access the same or different sampled data? (cross validation) · ahmedfgad/GeneticAlgorithmPython · Discussion #137

jdmoore7
Oct 17, 2022

I'm using PyGAD for cross-validation hyper-parameter tuning. I sample train/test data in the fitness function; I'm unclear on whether every strain i in iteration x will have access to the same sampled data or if sampling will differ across strains given the same iteration?

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.random import sample_without_replacement
import numpy as np 
gene_space = [ 
 # n_estimators
 np.linspace(50,200,25, dtype='int'),
 # min_samples_split, 
 np.linspace(2,10,5, dtype='int'),
 # min_samples_leaf,
 np.linspace(1,10,5, dtype='int'),
 # min_impurity_decrease
 np.linspace(0,1,10, dtype='float')
]
def fitness_function_factory(hyperparameters, data, y_name, sample_size):
 def fitness_function(solution, solution_idx):
 model = RandomForestClassifier(
 n_estimators=solution[0],
 min_samples_split=solution[1],
 min_samples_leaf=solution[2],
 min_impurity_decrease=solution[3]
 )
 
 X = data.drop(columns=[y_name])
 y = data[y_name]
 X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size=0.5) 
 train_idx = sample_without_replacement(n_population=len(X_train), 
 n_samples=sample_size) 
 
 test_idx = sample_without_replacement(n_population=len(X_test), 
 n_samples=sample_size) 
 
 model.fit(X_train[train_idx], y_train[train_idx])
 fitness = model.score(X_test[test_idx], y_test[test_idx])
 
 return fitness 
 return fitness_function
cross_validation = pygad.GA(gene_space=gene_space,
 fitness_func=fitness_function_factory(),
 num_generations=100,
 num_parents_mating=4,
 sol_per_pop=8,
 # num_genes=5,
 parent_selection_type='sss',
 keep_parents=2,
 crossover_type="single_point",
 mutation_type="random",
 mutation_percent_genes=10)

Replies: 1 comment

ahmedfgad
Feb 19, 2023
Maintainer

@jdmoore7,

Sorry for the getting back late!

The fitness function is called for each individual solution (or strain as you said). This means the data sampling will differ from one solution to another.

If you want to do sampling once for all solutions (strains) in the same generation (iteration), then you can:

For the first generation, do sampling in a code outside the fitness function.
For all other generations, use the on_generation() callback function to do sampling.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.random import sample_without_replacement
import numpy as np 
gene_space = [ 
 # n_estimators
 np.linspace(50,200,25, dtype='int'),
 # min_samples_split, 
 np.linspace(2,10,5, dtype='int'),
 # min_samples_leaf,
 np.linspace(1,10,5, dtype='int'),
 # min_impurity_decrease
 np.linspace(0,1,10, dtype='float')
]
X = data.drop(columns=[y_name])
y = data[y_name]
X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size=0.5) 
train_idx = sample_without_replacement(n_population=len(X_train), 
 n_samples=sample_size) 
 
test_idx = sample_without_replacement(n_population=len(X_test), 
 n_samples=sample_size) 
def on_generation(ga_instance):
 global X, y, X_train, X_test, y_train, y_test, train_idx, test_idx
 X = data.drop(columns=[y_name])
 y = data[y_name]
 
 X_train, X_test, y_train, y_test = train_test_split(X, y,
 test_size=0.5) 
 
 train_idx = sample_without_replacement(n_population=len(X_train), 
 n_samples=sample_size) 
 
 test_idx = sample_without_replacement(n_population=len(X_test), 
 n_samples=sample_size) 
def fitness_function_factory(hyperparameters, data, y_name, sample_size):
 def fitness_function(solution, solution_idx):
 model = RandomForestClassifier(
 n_estimators=solution[0],
 min_samples_split=solution[1],
 min_samples_leaf=solution[2],
 min_impurity_decrease=solution[3]
 )
 global X, y, X_train, X_test, y_train, y_test, train_idx, test_idx
 
 model.fit(X_train[train_idx], y_train[train_idx])
 fitness = model.score(X_test[test_idx], y_test[test_idx])
 
 return fitness 
 return fitness_function
cross_validation = pygad.GA(gene_space=gene_space,
 fitness_func=fitness_function_factory,
 num_generations=100,
 num_parents_mating=4,
 sol_per_pop=8,
 # num_genes=5,
 parent_selection_type='sss',
 keep_parents=2,
 crossover_type="single_point",
 mutation_type="random",
 mutation_percent_genes=10)

0 replies

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

In a given iteration, will every strain access the same or different sampled data? (cross validation) #137

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

jdmoore7
Oct 17, 2022

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

ahmedfgad
Feb 19, 2023
Maintainer

Select a reply

Uh oh!

Uh oh!

In a given iteration, will every strain access the same or different sampled data? (cross validation) #137

Uh oh!

Uh oh!

jdmoore7 Oct 17, 2022

Replies: 1 comment

Uh oh!

ahmedfgad Feb 19, 2023 Maintainer

jdmoore7
Oct 17, 2022

ahmedfgad
Feb 19, 2023
Maintainer