kinactive.model module

Models’ interface and creation pipeline.

class kinactive.model.DFGClassifier(in_model: KinactiveClassifier, out_model: KinactiveClassifier, other_model: KinactiveClassifier, meta_model: KinactiveClassifier)[source]

Bases: ModelBase

A composite model encapsulating three binary classifiers each predicting its own DFG conformation and a logistic regression meta-classifier trained on the [in, other, out] probabilities.

Nevertheless, it behaves like a regular model providing interface similar to the KinActiveClassifier.

__init__(in_model: KinactiveClassifier, out_model: KinactiveClassifier, other_model: KinactiveClassifier, meta_model: KinactiveClassifier)[source]
cv(df: DataFrame, n: int, verbose: bool = True)[source]

Cross-validate the model.

Parameters:
  • df – Data to use for training/testing.

  • n – The number of CV folds.

Returns:

A performance estimate aggregated across testing folds.

cv_pred(df: DataFrame, n: int, verbose: bool = True)[source]

Cross-validate the score and predict the data in test folds.

Parameters:
  • df – Input data with features and target columns.

  • n – The number of CV folds to use.

  • verbose – Output progress bar.

Returns:

A tuple with score and a copy of the supplied dataframe with fold assignment and model prediction columns added.

generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]

Generate fold indices from the provided data.

Parameters:
  • df – DataFrame with predictors.

  • n – The number of folds.

Returns:

An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.

predict(df: DataFrame) ndarray[source]

Predict the DFG class. 0 stands for DFGin, 1 for DFGout, and 2 for DFGinter.

Note

This is equivalent to predict_full() and selecting the relevant column.

Parameters:

df – A dataset to predict from. Must include all relevant variables.

Returns:

An array of predicted classes.

predict_full(df: DataFrame) DataFrame[source]

Predict all response variables.

Parameters:

df – A dataset to predict on. Must include all relevant variables.

Returns:

A copy of the df with predictions.

reinit_model()[source]

Reinitialize model.

score(df: DataFrame, **kwargs)[source]

Score the model.

KinactiveClassifier uses f1_score().

KinactiveRegressor uses r2_score().

Parameters:

df – Data to predict from.

Returns:

A single number – model’s performance estimate (the higher the better).

train(df: DataFrame)[source]
  1. Train models

  2. Use trained models to predict their response variables.

  3. Use predicted variables to train the meta model.

Parameters:

df – A dataset to train on. Must include all relevant variables.

property dfg_features: list[str]
Returns:

A list of features used by the XGBoost binary “in”, “out”, and “other” models.

property meta_features: list[str]
Returns:

A list of features used by the “meta” LR classifier.

property proba_names: list[str]
Returns:

A list of column names of [in, out, other] probabilities.

property targets: list[str]
Returns:

A sequence of target variables.

class kinactive.model.DFGModels(in_, out, other, meta)

Bases: tuple

in_: KinactiveClassifier

Alias for field number 0

meta: KinactiveClassifier

Alias for field number 3

other: KinactiveClassifier

Alias for field number 2

out: KinactiveClassifier

Alias for field number 1

class kinactive.model.EarlyStoppingCallback(early_stopping_rounds: int, direction: str = 'minimize')[source]

Bases: object

Early stopping callback for Optuna.

See https://github.com/optuna/optuna/issues/1001#issuecomment-862843041

__call__(study: Study, trial: Trial) None[source]

Call self as a function.

__init__(early_stopping_rounds: int, direction: str = 'minimize') None[source]
class kinactive.model.KinactiveClassifier(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]

Bases: KinactiveModel

A model wrapper for classification objective.

generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]

Generate fold indices from the provided data.

Parameters:
  • df – DataFrame with predictors.

  • n – The number of folds.

Returns:

An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.

predict_proba(df: DataFrame) ndarray[source]

Predict classes’ probabilities.

Parameters:

df – A tabular dataset with features and target columns.

Returns:

The array of predicted probabilities. Its shape depends on the number of targets and classes.

score(df: DataFrame, **kwargs) float[source]

Predict and score using the f1_score() function. For multiclass problems, the average is “micro” by default unless specified otherwise by kwargs.

Parameters:
  • df – A tabular dataset with features and target columns.

  • kwargs – Passed to the scoring function.

Returns:

The resulting score.

class kinactive.model.KinactiveModel(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]

Bases: ModelBase

An interface wrapper around the ML algorithm.

Its methods operate on a DataFrame, applying stored features() and targets() to obtain necessary variables.

See also

make() – a model’s creation pipeline.

__init__(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]
Parameters:
  • model – A model defining fit and predict methods. select_params() assumes it to be either XGBoost or ``LogisticRegressionClassifier`.

  • targets – Target variables’ names.

  • features – Feature variables’ names.

  • params – Initial parameters for the model.

  • use_early_stopping – If True, and the model is either XGBClassifier or XGBRegressor, the train() will split the provided dataset into training and evaluation parts and use the evaluation part to monitor the loss function and stop adding new trees (thus, finish training), if the loss didn’t improve for a number of consecutive steps. The number of early stopping rounds should be provided in params.

  • selector – The feature selector to use in select_params().

cv(df: DataFrame, n: int, verbose: bool = False) float[source]

Cross-validate the model.

Parameters:
  • df – Data to use for training/testing.

  • n – The number of CV folds.

Returns:

A performance estimate aggregated across testing folds.

cv_pred(df: DataFrame, n: int, verbose: bool = False) tuple[float, pandas.core.frame.DataFrame][source]

Cross-validate the score and predict the data in test folds.

Parameters:
  • df – Input data with features and target columns.

  • n – The number of CV folds to use.

  • verbose – Output progress bar.

Returns:

A tuple with score and a copy of the supplied dataframe with fold assignment and model prediction columns added.

predict(df: DataFrame) ndarray[source]

Make predictions from the provided data.

rank_features(features: Sequence[str] | None, **kwargs)[source]

Rank features using selector.rank().

Parameters:
  • features – A sequence of features. If not provided, will use features().

  • kwargs – Passed to selector.rank().

Returns:

A table with ranked features.

reinit_model()[source]

Reinitialize model.

select_features(df: DataFrame, **kwargs) eBoruta[source]

Select important features and store the selection to features().

Parameters:
  • df – A dataframe with features and targets.

  • kwargs – Passed to the selector.

Returns:

The selector.fit() output.

select_params(df: DataFrame, n_trials: int, direction: str = 'maximize', early_stopping_rounds: int = 0) Study[source]

Optimize hyperparameters.

Parameters:
  • df – Input data with features and target columns.

  • n_trials – The number of optimization rounds.

  • direction – “maximize” or “minimize” the objective.

  • early_stopping_rounds – The number of early stopping rounds to use. Zero means no early stopping.

Returns:

The Study instance from optuna.

train(df: DataFrame)[source]

Train the model on entirety of the provided data.

property features: list[str]
Returns:

A list of features used to train the model.

property model
Returns:

Current model instance.

params

Model’s parameters.

selector

eBoruta instance

property targets: list[str]
Returns:

A sequence of target variables.

use_early_stopping

Use early stopping via eval set.

class kinactive.model.KinactiveRegressor(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]

Bases: KinactiveModel

A model wrapper for regression objective.

generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]

Generate fold indices from the provided data.

Parameters:
  • df – DataFrame with predictors.

  • n – The number of folds.

Returns:

An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.

score(df: DataFrame, **kwargs) float[source]

Predict and score using the r2_score() function.

Parameters:
  • df – A tabular dataset with features and target columns.

  • kwargs – Passed to the scoring function.

Returns:

The resulting score.

class kinactive.model.ModelBase[source]

Bases: object

An abstract base class for model objects.

abstract cv(df: DataFrame, n: int) float[source]

Cross-validate the model.

Parameters:
  • df – Data to use for training/testing.

  • n – The number of CV folds.

Returns:

A performance estimate aggregated across testing folds.

abstract generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]

Generate fold indices from the provided data.

Parameters:
  • df – DataFrame with predictors.

  • n – The number of folds.

Returns:

An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.

abstract predict(df: DataFrame)[source]

Make predictions from the provided data.

abstract reinit_model()[source]

Reinitialize model.

abstract score(df: DataFrame) float[source]

Score the model.

KinactiveClassifier uses f1_score().

KinactiveRegressor uses r2_score().

Parameters:

df – Data to predict from.

Returns:

A single number – model’s performance estimate (the higher the better).

abstract train(df: DataFrame)[source]

Train the model on entirety of the provided data.

abstract property targets: Sequence[str]
Returns:

A sequence of target variables.

class kinactive.model.ModelT(*args, **kwargs)[source]

Bases: Protocol

A minimalistic model interface.

__init__(*args, **kwargs)
fit(x: DataFrame | ndarray, y: ndarray | Series, **kwargs) ModelT[source]

Fit the model

predict(x: DataFrame | ndarray, **kwargs) ndarray[source]

Predict the results.

predict_proba(x: DataFrame | ndarray, **kwargs) ndarray[source]

Predict classes’ probabilities.

class kinactive.model.ObjectiveFn(*args, **kwargs)[source]

Bases: Protocol

An objective function type.

__call__(trial: Trial, df: DataFrame, model: ModelBase, n_cv: int) float[source]

Call self as a function.

__init__(*args, **kwargs)
kinactive.model.lr_objective(trial: Trial, df: DataFrame, model: KinactiveModel, n_cv: int = 5, use_early_stopping: bool = False) float[source]

A default objective function for the logistic regression model.

It optimizes the following params:

C: [0.0, 1.0]
class_weight: [None, "balanced"]
solver: ["newton-cg", "sag", "saga", "lbfgs"]
multi_class: ["auto", "ovr", "multinomial"]

If solver == "saga", it encodes “l2” as the penalty parameters. Otherwise, it chooses between “l1”, “l2”, and “elasticnet”. If the latter is chosen, it adds samples the l1_ratio parameter between zero and one.

The options max_iter and n_jobs are hard-coded to 1000 and -1.

After sampling, the process is identical to the xgb_objective().

Parameters:
  • trial – A trial instance used dynamically by optuna. Leave as is.

  • df – A dataset used to fit and test the model.

  • model – The model to optimize the params for.

  • n_cv – The number of CV folds to derive the score.

  • use_early_stopping – Passed to the model.

Returns:

The cross-validated score.

kinactive.model.make(df: DataFrame, targets: list[str], features: list[str], starting_params: dict[str, Any], use_early_stopping: bool = False, early_stopping_rounds_param_sel: int = 0, classifier: bool = True, n_trials_sel_1: int = 50, n_trials_sel_2: int = 50, n_final_cv: int = 10, boruta_kwargs: dict[str, Any] | None = None) tuple[kinactive.model.KinactiveClassifier | kinactive.model.KinactiveRegressor, float, pandas.core.frame.DataFrame][source]

A pipeline to make a new KinActive model. It comprises:

  1. Initializing the model using starting params.

  2. A parameter-selection run.

  3. A feature selection run.

  4. Another parameter selection run.

  5. Cross-validate and predict on test folds.

  6. Train on the full dataset.

Parameters:
  • df – A table to train on.

  • targets – The names of the target columns.

  • features – The names of the feature columns.

  • starting_params – The starting model’s parameters.

  • use_early_stopping – Use early stopping to cap the number of trees. The early_stopping_rounds param may be provided via starting_params.

  • early_stopping_rounds_param_sel – The number of early stopping rounds for the hyperparameter optimization. 0 indicates no early stopping.

  • classifier – If True, assume classification objective and init the KinactiveClassifier. Otherwise, assume the regression and init the KinactiveRegressor.

  • n_trials_sel_1 – The number of parameter selection rounds before the feature selection.

  • n_trials_sel_2 – The number of parameter selection rounds after the feature selection.

  • n_final_cv – The number of CV folds for the final CV.

  • boruta_kwargs – Passed to the eBoruta feature selector.

Returns:

kinactive.model.xgb_objective(trial: Trial, df: DataFrame, model: KinactiveModel, n_cv: int = 5, use_early_stopping: bool = False) float[source]

A default objective function for XGB models. It uses the following setup:

learning_rate:     [0, 1]
max_depth:         [4, 16]
gamma:             [0.0, 10.0]
reg_lambda:        [0.0, 10.0]
reg_alpha:         [0.0, 10.0]
colsample_bytree:  [0.4, 1.0]
colsample_bylevel: [0.4, 1.0]

Additionally, for the XGBclassifier it adds:

scale_pos_weight:  [0.0, 10.0]

After the parameters are sampled, they are combined with the existing model parameters via {**model.params, **params}. Then, the model is instantiated with the new parameters and cross-validated using KinactiveModel.cv().

Parameters:
  • trial – A trial instance used dynamically by optuna. Leave as is.

  • df – A dataset used to fit and test the model.

  • model – The model to optimize the params for.

  • n_cv – The number of CV folds to derive the score.

  • use_early_stopping – Passed to the model.

Returns:

The cross-validated score.