Utils models
apply_pipeline(df, preprocess_pipeline)
Applies a fitted pipeline to a dataframe
Problem
The pipeline expects as input the same columns and in the same order even if some columns are then dropped (and so useless)
Solution (experimental 14/04/2021): We add the "useless" columns as columns filled with NaNs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe to preprocess |
required |
preprocess_pipeline |
ColumnTransformer
|
Pipeline to use |
required |
Raises: ValueError: If some mandatory columns are missing Returns: pd.DataFrame: Preprocessed dataFrame
Source code in template_num/models_training/utils_models.py
display_train_test_shape(df_train, df_test, df_shape=None)
Displays the size of a train/test split
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_train |
DataFrame
|
Train dataset |
required |
df_test |
DataFrame
|
Test dataset |
required |
Kwargs: df_shape (int): Size of the initial dataset Raises: ValueError: If the object df_shape is not positive
Source code in template_num/models_training/utils_models.py
get_columns_pipeline(preprocess_pipeline)
Retrieves a pipeline wanted columns, and mandatory ones
Parameters:
Name | Type | Description | Default |
---|---|---|---|
preprocess_pipeline |
ColumnTransformer
|
Preprocessing pipeline |
required |
Returns: list: List of columns in list: List of mandatory ones
Source code in template_num/models_training/utils_models.py
load_model(model_dir, is_path=False)
Loads a model from a path
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_dir |
str
|
Name of the folder containing the model (e.g. model_autres_2019_11_07-13_43_19) |
required |
Kwargs: is_path (bool): If folder path instead of name (permits to load model from elsewhere) Returns: ?: Model dict: Model configurations
Source code in template_num/models_training/utils_models.py
load_pipeline(pipeline_dir, is_path=False)
Loads a pipeline from the pipelines folder
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pipeline_dir |
str
|
Name of the folder containing the pipeline to get. If None, backups on "no_preprocess" |
required |
Kwargs: is_path (bool): If path to the folder instead of the name (permits the loading from elsewhere) Returns: Pipeline: Reloaded pipeline str: Name of the preprocessing used
Source code in template_num/models_training/utils_models.py
normal_split(df, test_size=0.25, seed=None)
Splits a DataFrame into train and test sets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe containing the data |
required |
Kwargs: test_size (float): Proportion representing the size of the expected test set seed (int): random seed Raises: ValueError: If the object test_size is not between 0 and 1 Returns: DataFrame: Train dataframe DataFrame: Test dataframe
Source code in template_num/models_training/utils_models.py
predict(content, model, inference_batch_size=128, alternative_version=False, **kwargs)
Gets predictions of a model on a dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content |
DataFrame
|
New dataset to be predicted |
required |
model |
ModelClass
|
Model to use |
required |
Kwargs:
inference_batch_size (int): size (approximate) of batches
alternative_version (bool): If an alternative version (tf.function
+ model.__call__
) must be used.
Should be faster with low nb of inputs. Only useful for Keras models.
We advise you to set alternative_version
to True for APIs to avoid possible memory leaks with model.predict
on newest TensorFlow.
https://github.com/tensorflow/tensorflow/issues/58676
Inferences will probably be way faster too.
Returns:
REGRESSION :
float: prediction
MONO-LABEL CLASSIFICATION:
str: prediction
MULTI-LABELS CLASSIFICATION:
tuple: predictions
If several elements -> list
Source code in template_num/models_training/utils_models.py
predict_with_proba(content, model, inference_batch_size=128, alternative_version=False, **kwargs)
Gets probabilities predictions of a model on a dataset
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content |
DataFrame
|
New dataset to be predicted |
required |
model |
ModelClass
|
Model to use |
required |
Kwargs:
inference_batch_size (int): size (approximate) of batches
alternative_version (bool): If an alternative version (tf.function
+ model.__call__
) must be used.
Should be faster with low nb of inputs. Only useful for Keras models.
We advise you to set alternative_version
to True for APIs to avoid possible memory leaks with model.predict
on newest TensorFlow.
https://github.com/tensorflow/tensorflow/issues/58676
Inferences will probably be way faster too.
Raises:
ValueError: If the model type is not classifier
Returns:
MONO-LABEL CLASSIFICATION:
List[str]: predictions
List[float]: probabilities
MULTI-LABELS CLASSIFICATION:
List[tuple]: predictions
List[tuple]: probabilities
Source code in template_num/models_training/utils_models.py
preprocess_model_multilabel(df, y_col, classes=None)
Prepares a dataframe for a multi-labels classification
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Training dataset This dataset must be preprocessed. Example: # Group by & apply tuple to y_col x_cols = [col for col in list(df.columns) if col != y_col] df = pd.DataFrame(df.groupby(x_cols)[y_col].apply(tuple)) |
required |
y_col |
str or int
|
Name of the column to be used for training - y |
required |
Kwargs: classes (list): List of classes to consider Returns: DataFrame: Dataframe for training list: List of 'y' columns
Source code in template_num/models_training/utils_models.py
remove_small_classes(df, col, min_rows=2)
Deletes the classes with small numbers of elements
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe containing the data |
required |
col |
str | int
|
Columns containing the classes |
required |
Kwargs: min_rows (int): Minimal number of lines in the training set (default: 2) Raises: ValueError: If the object min_rows is not positive Returns: pd.DataFrame: New dataset
Source code in template_num/models_training/utils_models.py
search_hp_cv(model_cls, model_params, hp_params, scoring_fn, kwargs_fit, n_splits=5)
Searches for hyperparameters - works only with classifiers !
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_cls |
?
|
Class of models on which to do a hyperparameters search |
required |
model_params |
dict
|
Set of "fixed" parameters of the model (e.g. x_col, y_col). Must contain 'multi_label'. |
required |
hp_params |
dict
|
Set of "variable" parameters on which to do a hyperparameters search |
required |
scoring_fn |
str or func
|
Scoring function to maximize This function must take as input a dictionary containing metrics e.g. {'F1-Score': 0.85, 'Accuracy': 0.57, 'Precision': 0.64, 'Recall': 0.90} |
required |
kwargs_fit |
dict
|
Set of kwargs to input in the fit function Must contain 'x_train' and 'y_train' |
required |
Kwargs: n_splits (int): Number of folds to use Raises: ValueError: If scoring_fn is not a known string ValueError: If multi_label is not a key in model_params ValueError: If x_train is not a key in kwargs_fit ValueError: If y_train is not a key in kwargs_fit ValueError: If model_params and hp_params share some keys ValueError: If hp_params values are not the same length ValueError: If the number of crossvalidation split is less or equal to 1 Returns: ModelClass: best model to be "fitted" on the dataset
Source code in template_num/models_training/utils_models.py
441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 |
|
stratified_split(df, col, test_size=0.25, seed=None)
Splits a DataFrame into train and test sets - Stratified strategy
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe containing the data |
required |
col |
str or int
|
column on which to do the stratified split |
required |
Kwargs: test_size (float): Proportion representing the size of the expected test set seed (int): Random seed Raises: ValueError: If the object test_size is not between 0 and 1 Returns: DataFrame: Train dataframe DataFrame: Test dataframe