Utils models
display_train_test_shape(df_train, df_test, df_shape=None)
Displays the size of a train/test split
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df_train |
DataFrame
|
Train dataset |
required |
df_test |
DataFrame
|
Test dataset |
required |
Kwargs: df_shape (int): Size of the initial dataset Raises: ValueError: If the object df_shape is not positive
Source code in template_nlp/models_training/utils_models.py
get_embedding(embedding_name='cc.fr.300.pkl')
Loads an embedding previously saved as a .pkl file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_name |
str
|
Name of the embedding file (actually a path relative to template_nlp-data) |
'cc.fr.300.pkl'
|
Raises: FileNotFoundError: If the embedding file does not exist in template_nlp-data Returns: dict: Loaded embedding
Source code in template_nlp/models_training/utils_models.py
hierarchical_split(df, col, test_size=0.25, seed=None)
Splits a DataFrame into train and test sets - Hierarchical strategy
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe containing the data |
required |
col |
str or int
|
column on which to do the hierarchical split |
required |
Kwargs: test_size (float): Proportion representing the size of the expected test set seed (int): Random seed Raises: ValueError: If the object test_size is not between 0 and 1 Returns: DataFrame: Train dataframe DataFrame: Test dataframe
Source code in template_nlp/models_training/utils_models.py
load_model(model_dir, is_path=False, **kwargs)
Loads a model from a path or a model name
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_dir |
str
|
Name of the folder containing the model (e.g. model_autres_2019_11_07-13_43_19) It can also be an absolute path if is_path is set to True |
required |
Kwargs: is_path (bool): If folder path instead of name (allows to load model from anywhere) Returns: ModelClass: The loaded model dict: The model configurations
Source code in template_nlp/models_training/utils_models.py
normal_split(df, test_size=0.25, seed=None)
Splits a DataFrame into train and test sets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe containing the data |
required |
Kwargs: test_size (float): Proportion representing the size of the expected test set seed (int): random seed Raises: ValueError: If the object test_size is not between 0 and 1 Returns: DataFrame: Train dataframe DataFrame: Test dataframe
Source code in template_nlp/models_training/utils_models.py
predict(content, model, model_conf, inference_batch_size=128, alternative_version=False, **kwargs)
Gets predictions of a model on a content
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content |
Union[str, list]
|
New content to be predicted |
required |
model |
ModelClass
|
Model to use |
required |
model_conf |
dict
|
Model configurations |
required |
Kwargs:
inference_batch_size (int): size (approximate) of batches
alternative_version (bool): If an alternative version (tf.function
+ model.__call__
) must be used.
Should be faster with low nb of inputs. Only useful for Keras models.
We advise you to set alternative_version
to True for APIs to avoid possible memory leaks with model.predict
on newest TensorFlow.
https://github.com/tensorflow/tensorflow/issues/58676
Inferences will probably be way faster too.
Returns:
list: a list of strings (resp. tuples) in case of mono-label (resp. multi-labels) classification predictions
Source code in template_nlp/models_training/utils_models.py
predict_with_proba(content, model, model_conf, inference_batch_size=128, alternative_version=False, **kwargs)
Gets predictions of a model on a content, with probabilities
Parameters:
Name | Type | Description | Default |
---|---|---|---|
content |
Union[str, list]
|
New content to be predicted |
required |
model |
ModelClass
|
Model to use |
required |
model_conf |
dict
|
Model configurations |
required |
Kwargs:
inference_batch_size (int): size (approximate) of batches
alternative_version (bool): If an alternative version (tf.function
+ model.__call__
) must be used.
Should be faster with low nb of inputs. Only useful for Keras models.
We advise you to set alternative_version
to True for APIs to avoid possible memory leaks with model.predict
on newest TensorFlow.
https://github.com/tensorflow/tensorflow/issues/58676
Inferences will probably be way faster too.
Returns:
MONO-LABEL CLASSIFICATION:
List[str]: predictions
List[float]: probabilities
MULTI-LABELS CLASSIFICATION:
List[tuple]: predictions
List[tuple]: probabilities
Source code in template_nlp/models_training/utils_models.py
preprocess_model_multilabel(df, y_col, classes=None)
Prepares a dataframe for a multi-labels classification
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Training dataset This dataset must be preprocessed. Example: # Group by & apply tuple to y_col x_cols = [col for col in list(df.columns) if col != y_col] df = pd.DataFrame(df.groupby(x_cols)[y_col].apply(tuple)) |
required |
y_col |
str or int
|
Name of the column to be used for training - y |
required |
Kwargs: classes (list): List of classes to consider Returns: DataFrame: Dataframe for training list: List of 'y' columns
Source code in template_nlp/models_training/utils_models.py
remove_small_classes(df, col, min_rows=2)
Deletes the classes with small numbers of elements
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe containing the data |
required |
col |
str | int
|
Columns containing the classes |
required |
Kwargs: min_rows (int): Minimal number of lines in the training set (default: 2) Raises: ValueError: If the object min_rows is not positive Returns: pd.DataFrame: New dataset
Source code in template_nlp/models_training/utils_models.py
search_hp_cv(model_cls, model_params, hp_params, scoring_fn, kwargs_fit, n_splits=5)
Searches for hyperparameters
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_cls |
?
|
Class of models on which to do a hyperparameters search |
required |
model_params |
dict
|
Set of "fixed" parameters of the model (e.g. x_col, y_col). Must contain 'multi_label'. |
required |
hp_params |
dict
|
Set of "variable" parameters on which to do a hyperparameters search |
required |
scoring_fn |
str or func
|
Scoring function to maximize This function must take as input a dictionary containing metrics e.g. {'F1-Score': 0.85, 'Accuracy': 0.57, 'Precision': 0.64, 'Recall': 0.90} |
required |
kwargs_fit |
dict
|
Set of kwargs to input in the fit function Must contain 'x_train' and 'y_train' |
required |
Kwargs: n_splits (int): Number of folds to use Raises: ValueError: If scoring_fn is not a known string ValueError: If multi_label is not a key in model_params ValueError: If x_train is not a key in kwargs_fit ValueError: If y_train is not a key in kwargs_fit ValueError: If model_params and hp_params share some keys ValueError: If hp_params values are not the same length ValueError: If the number of crossvalidation split is less or equal to 1 Returns: ModelClass: best model to be "fitted" on the dataset
Source code in template_nlp/models_training/utils_models.py
372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 |
|
stratified_split(df, col, test_size=0.25, seed=None)
Splits a DataFrame into train and test sets - Stratified strategy
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df |
DataFrame
|
Dataframe containing the data |
required |
col |
str or int
|
column on which to do the stratified split |
required |
Kwargs: test_size (float): Proportion representing the size of the expected test set seed (int): Random seed Raises: ValueError: If the object test_size is not between 0 and 1 Returns: DataFrame: Train dataframe DataFrame: Test dataframe