Applying Darwinian Evolution to feature selection with Kydavra GeneticAlgorithmSelector

Applying Darwinian Evolution to feature selection with Kydavra GeneticAlgorithmSelector

The development of machine learning implies a lot of maths. But sometimes during feature selection phase maths sometimes can’t give an exact answer (because of the structure of data, it’s source, and many other causes). Then in-game enter the Programming tricks, mostly brute force methods :).

Genetic algorithms are a family of algorithms inspired by biological evolution, that basically use the cycle — cross, mutate, try, developing the best combination of states depending on the scoring metric. So, let’s get to the code.

Using GeneticAlgorithmSelector from Kydavra library.

To install kydavra just write the following command in terminal:

pip install kydavra

Now you can import the Selector and apply it on your data set a follows:

from kydavra import GeneticAlgorithmSelector

selector = GeneticAlgorithmSelector()
new_columns = selector.select(model, df, ‘target’)

As with every Kydavra selector that’s all. Now let’s try it on the Hearth disease dataset.

import pandas as pd

df = pd.read_csv(‘cleaned.csv’)

I highly recommend you to shuffle your dataset before applying the selector, because it uses metrics (and right now cross_val_score isn’t implemented in this selector).

df = df.sample(frac=1).reset_index(drop=True)

Now we can apply our selector. To mention it has some parameters:

  • nb_children (int, default = 4) the number of best children that the algorithm will choose for the next generation.
  • nb_generation (int, default = 200) the number of generations that will be created, technically speaking the number of iterations.
  • scoring_metric (sklearn scoring metric, default = accuracy_score) The metric score used to select the best feature combination.
  • max (boolean, default=True) if is set to True the algorithm will select the combinations with the highest score if False the lowest scores will be chosen.

But for now, we will use the basic setting except of the scoring_metric, because we have there a problem of disease diagnosis, so it will better to use Precision instead of accuracy.

from kydavra import GeneticAlgorithmSelector
from sklearn.metrics import precision_score
from sklearn.ensemble import RandomForestClassifier

selector = GeneticAlgorithmSelector(scoring_metric=precision_score)
model = RandomForestClassifier()

So now let’s find the best features. GAS (short version for GeneticAlgorithmSelector) need a sklearn model to train during the process of choosing features, the data frame itself and of course the name of target column:

selected_cols = selector.select(model, df, 'target')

Now let’s evaluate the result. Before feature selection, the precision score of the Random Forest was — 0.805. GAS choose the following features:

['age', 'sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']

Which gave the following precision score — 0.823. Which is a good result, knowing that in majority of cases it is very hard to level up the scoring metrics.

If you want to find out more about Genetic Algorithms at the bottom of the article are some useful links. If you tried Kydavra and have some issues or feedback, please contact me on medium or please fill this form.

Made with ❤ by Sigmoid

Useful links:

Discussion

Community guidelines