kinactive.model module
Models’ interface and creation pipeline.
- class kinactive.model.DFGClassifier(in_model: KinactiveClassifier, out_model: KinactiveClassifier, other_model: KinactiveClassifier, meta_model: KinactiveClassifier)[source]
Bases:
ModelBaseA composite model encapsulating three binary classifiers each predicting its own DFG conformation and a logistic regression meta-classifier trained on the [in, other, out] probabilities.
Nevertheless, it behaves like a regular model providing interface similar to the
KinActiveClassifier.- __init__(in_model: KinactiveClassifier, out_model: KinactiveClassifier, other_model: KinactiveClassifier, meta_model: KinactiveClassifier)[source]
- cv(df: DataFrame, n: int, verbose: bool = True)[source]
Cross-validate the model.
- Parameters:
df – Data to use for training/testing.
n – The number of CV folds.
- Returns:
A performance estimate aggregated across testing folds.
- cv_pred(df: DataFrame, n: int, verbose: bool = True)[source]
Cross-validate the score and predict the data in test folds.
- Parameters:
df – Input data with features and target columns.
n – The number of CV folds to use.
verbose – Output progress bar.
- Returns:
A tuple with score and a copy of the supplied dataframe with fold assignment and model prediction columns added.
- generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]
Generate fold indices from the provided data.
- Parameters:
df – DataFrame with predictors.
n – The number of folds.
- Returns:
An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.
- predict(df: DataFrame) ndarray[source]
Predict the DFG class.
0stands for DFGin,1for DFGout, and2for DFGinter.Note
This is equivalent to
predict_full()and selecting the relevant column.- Parameters:
df – A dataset to predict from. Must include all relevant variables.
- Returns:
An array of predicted classes.
- predict_full(df: DataFrame) DataFrame[source]
Predict all response variables.
- Parameters:
df – A dataset to predict on. Must include all relevant variables.
- Returns:
A copy of the
dfwith predictions.
- score(df: DataFrame, **kwargs)[source]
Score the model.
KinactiveClassifierusesf1_score().KinactiveRegressorusesr2_score().- Parameters:
df – Data to predict from.
- Returns:
A single number – model’s performance estimate (the higher the better).
- train(df: DataFrame)[source]
Train
modelsUse trained
modelsto predict their response variables.Use predicted variables to train the meta model.
- Parameters:
df – A dataset to train on. Must include all relevant variables.
- property dfg_features: list[str]
- Returns:
A list of features used by the XGBoost binary “in”, “out”, and “other” models.
- property meta_features: list[str]
- Returns:
A list of features used by the “meta” LR classifier.
- property proba_names: list[str]
- Returns:
A list of column names of [in, out, other] probabilities.
- property targets: list[str]
- Returns:
A sequence of target variables.
- class kinactive.model.DFGModels(in_, out, other, meta)
Bases:
tuple- in_: KinactiveClassifier
Alias for field number 0
- meta: KinactiveClassifier
Alias for field number 3
- other: KinactiveClassifier
Alias for field number 2
- out: KinactiveClassifier
Alias for field number 1
- class kinactive.model.EarlyStoppingCallback(early_stopping_rounds: int, direction: str = 'minimize')[source]
Bases:
objectEarly stopping callback for Optuna.
See https://github.com/optuna/optuna/issues/1001#issuecomment-862843041
- class kinactive.model.KinactiveClassifier(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]
Bases:
KinactiveModelA model wrapper for classification objective.
- generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]
Generate fold indices from the provided data.
- Parameters:
df – DataFrame with predictors.
n – The number of folds.
- Returns:
An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.
- predict_proba(df: DataFrame) ndarray[source]
Predict classes’ probabilities.
- Parameters:
df – A tabular dataset with features and target columns.
- Returns:
The array of predicted probabilities. Its shape depends on the number of targets and classes.
- score(df: DataFrame, **kwargs) float[source]
Predict and score using the
f1_score()function. For multiclass problems, theaverageis “micro” by default unless specified otherwise by kwargs.- Parameters:
df – A tabular dataset with features and target columns.
kwargs – Passed to the scoring function.
- Returns:
The resulting score.
- class kinactive.model.KinactiveModel(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]
Bases:
ModelBaseAn interface wrapper around the ML algorithm.
Its methods operate on a
DataFrame, applying storedfeatures()andtargets()to obtain necessary variables.See also
make()– a model’s creation pipeline.- __init__(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]
- Parameters:
model – A model defining
fitandpredictmethods.select_params()assumes it to be either XGBoost or ``LogisticRegressionClassifier`.targets – Target variables’ names.
features – Feature variables’ names.
params – Initial parameters for the model.
use_early_stopping – If
True, and the model is eitherXGBClassifierorXGBRegressor, thetrain()will split the provided dataset into training and evaluation parts and use the evaluation part to monitor the loss function and stop adding new trees (thus, finish training), if the loss didn’t improve for a number of consecutive steps. The number of early stopping rounds should be provided inparams.selector – The feature selector to use in
select_params().
- cv(df: DataFrame, n: int, verbose: bool = False) float[source]
Cross-validate the model.
- Parameters:
df – Data to use for training/testing.
n – The number of CV folds.
- Returns:
A performance estimate aggregated across testing folds.
- cv_pred(df: DataFrame, n: int, verbose: bool = False) tuple[float, pandas.core.frame.DataFrame][source]
Cross-validate the score and predict the data in test folds.
- Parameters:
df – Input data with features and target columns.
n – The number of CV folds to use.
verbose – Output progress bar.
- Returns:
A tuple with score and a copy of the supplied dataframe with fold assignment and model prediction columns added.
- rank_features(features: Sequence[str] | None, **kwargs)[source]
Rank features using
selector.rank().- Parameters:
features – A sequence of features. If not provided, will use
features().kwargs – Passed to
selector.rank().
- Returns:
A table with ranked features.
- select_features(df: DataFrame, **kwargs) eBoruta[source]
Select important features and store the selection to
features().- Parameters:
df – A dataframe with features and targets.
kwargs – Passed to the
selector.
- Returns:
The
selector.fit()output.
- select_params(df: DataFrame, n_trials: int, direction: str = 'maximize', early_stopping_rounds: int = 0) Study[source]
Optimize hyperparameters.
- Parameters:
df – Input data with features and target columns.
n_trials – The number of optimization rounds.
direction – “maximize” or “minimize” the objective.
early_stopping_rounds – The number of early stopping rounds to use. Zero means no early stopping.
- Returns:
The
Studyinstance from optuna.
- property features: list[str]
- Returns:
A list of features used to train the model.
- property model
- Returns:
Current model instance.
- params
Model’s parameters.
- selector
eBoruta instance
- property targets: list[str]
- Returns:
A sequence of target variables.
- use_early_stopping
Use early stopping via eval set.
- class kinactive.model.KinactiveRegressor(model: ModelT, targets: Iterable[str], features: Iterable[str] = (), params: dict[str, Any] | None = None, use_early_stopping: bool = False, selector: eBoruta | None = None)[source]
Bases:
KinactiveModelA model wrapper for regression objective.
- generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]
Generate fold indices from the provided data.
- Parameters:
df – DataFrame with predictors.
n – The number of folds.
- Returns:
An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.
- class kinactive.model.ModelBase[source]
Bases:
objectAn abstract base class for model objects.
- abstract cv(df: DataFrame, n: int) float[source]
Cross-validate the model.
- Parameters:
df – Data to use for training/testing.
n – The number of CV folds.
- Returns:
A performance estimate aggregated across testing folds.
- abstract generate_fold_idx(df: DataFrame, n: int) Iterator[tuple[numpy.ndarray, numpy.ndarray]][source]
Generate fold indices from the provided data.
- Parameters:
df – DataFrame with predictors.
n – The number of folds.
- Returns:
An iterator over tuples with train and test boolean indices allowing to select train and test observations from df.
- abstract score(df: DataFrame) float[source]
Score the model.
KinactiveClassifierusesf1_score().KinactiveRegressorusesr2_score().- Parameters:
df – Data to predict from.
- Returns:
A single number – model’s performance estimate (the higher the better).
- abstract property targets: Sequence[str]
- Returns:
A sequence of target variables.
- class kinactive.model.ModelT(*args, **kwargs)[source]
Bases:
ProtocolA minimalistic model interface.
- __init__(*args, **kwargs)
- class kinactive.model.ObjectiveFn(*args, **kwargs)[source]
Bases:
ProtocolAn objective function type.
- __call__(trial: Trial, df: DataFrame, model: ModelBase, n_cv: int) float[source]
Call self as a function.
- __init__(*args, **kwargs)
- kinactive.model.lr_objective(trial: Trial, df: DataFrame, model: KinactiveModel, n_cv: int = 5, use_early_stopping: bool = False) float[source]
A default objective function for the logistic regression model.
It optimizes the following params:
C: [0.0, 1.0] class_weight: [None, "balanced"] solver: ["newton-cg", "sag", "saga", "lbfgs"] multi_class: ["auto", "ovr", "multinomial"]
If
solver == "saga", it encodes “l2” as thepenaltyparameters. Otherwise, it chooses between “l1”, “l2”, and “elasticnet”. If the latter is chosen, it adds samples thel1_ratioparameter between zero and one.The options
max_iterandn_jobsare hard-coded to 1000 and -1.After sampling, the process is identical to the
xgb_objective().- Parameters:
trial – A trial instance used dynamically by optuna. Leave as is.
df – A dataset used to fit and test the model.
model – The model to optimize the params for.
n_cv – The number of CV folds to derive the score.
use_early_stopping – Passed to the
model.
- Returns:
The cross-validated score.
- kinactive.model.make(df: DataFrame, targets: list[str], features: list[str], starting_params: dict[str, Any], use_early_stopping: bool = False, early_stopping_rounds_param_sel: int = 0, classifier: bool = True, n_trials_sel_1: int = 50, n_trials_sel_2: int = 50, n_final_cv: int = 10, boruta_kwargs: dict[str, Any] | None = None) tuple[kinactive.model.KinactiveClassifier | kinactive.model.KinactiveRegressor, float, pandas.core.frame.DataFrame][source]
A pipeline to make a new
KinActivemodel. It comprises:Initializing the model using starting params.
A parameter-selection run.
A feature selection run.
Another parameter selection run.
Cross-validate and predict on test folds.
Train on the full dataset.
- Parameters:
df – A table to train on.
targets – The names of the target columns.
features – The names of the feature columns.
starting_params – The starting model’s parameters.
use_early_stopping – Use early stopping to cap the number of trees. The
early_stopping_roundsparam may be provided viastarting_params.early_stopping_rounds_param_sel – The number of early stopping rounds for the hyperparameter optimization.
0indicates no early stopping.classifier – If
True, assume classification objective and init theKinactiveClassifier. Otherwise, assume the regression and init theKinactiveRegressor.n_trials_sel_1 – The number of parameter selection rounds before the feature selection.
n_trials_sel_2 – The number of parameter selection rounds after the feature selection.
n_final_cv – The number of CV folds for the final CV.
boruta_kwargs – Passed to the
eBorutafeature selector.
- Returns:
- kinactive.model.xgb_objective(trial: Trial, df: DataFrame, model: KinactiveModel, n_cv: int = 5, use_early_stopping: bool = False) float[source]
A default objective function for XGB models. It uses the following setup:
learning_rate: [0, 1] max_depth: [4, 16] gamma: [0.0, 10.0] reg_lambda: [0.0, 10.0] reg_alpha: [0.0, 10.0] colsample_bytree: [0.4, 1.0] colsample_bylevel: [0.4, 1.0]
Additionally, for the
XGBclassifierit adds:scale_pos_weight: [0.0, 10.0]
After the parameters are sampled, they are combined with the existing model parameters via
{**model.params, **params}. Then, the model is instantiated with the new parameters and cross-validated usingKinactiveModel.cv().- Parameters:
trial – A trial instance used dynamically by optuna. Leave as is.
df – A dataset used to fit and test the model.
model – The model to optimize the params for.
n_cv – The number of CV folds to derive the score.
use_early_stopping – Passed to the
model.
- Returns:
The cross-validated score.