Calculating default variables

In this notebook, we’ll calculate descriptor (variable) sets 1-4: 1. Canonical (UniProt) sequence variables. 2. Structure (PDB) sequence variables. 3. Structure variables (angles, distances, etc.) 4. Ligand variables.

[2]:
import logging
import warnings
from random import sample
from pathlib import Path

# Supress import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    from kinactive import DefaultFeatures, DB, DBConfig

Provide general configuration.

[3]:
logging.basicConfig(level=logging.INFO)
[4]:
N_PROC = 20
N_CHAINS = 20  # Restrict the number of chains for demonstration

BASE = Path('../data/variable_sets')
BASE.mkdir(exist_ok=True)

DB_PATH = Path('../data/db_v3')
[5]:
paths = list(DB_PATH.glob('*'))
if N_CHAINS is not None:
    # Sample random chains to calculate the variables on.
    paths = sample(paths, N_CHAINS)
[6]:
db = DB(DBConfig(io_cpus=N_PROC))
chains = db.load(paths)
INFO:kinactive.db:Got 20 initial paths to read
INFO:kinactive.db:Parsed 20 `Chain`s
[7]:
vs = DefaultFeatures()
?vs.calculate_all_vs
Signature:
vs.calculate_all_vs(
    chains: collections.abc.Sequence[lXtractor.core.chain.chain.Chain],
    map_name: str = 'PK',
    num_proc: int | None = None,
    verbose: bool = True,
    base: pathlib.Path | None = None,
    overwrite: bool = False,
) -> kinactive.features.Results
Docstring:
Calculate default variables. These include four sets::

    #. A default set of sequence variables for canonical sequences.
    #. A default set of sequence variables for structure sequences.
    #. A default set of structure variables.
    #. A default set of ligand variables.

:param chains: A sequence of chains.
:param map_name: A reference name.
:param num_proc: The number of CPUs to use.
:param verbose: Display progress bar.
:param base: Base path to save the results to. If not provided, the
    results are returned but not saved.
:param overwrite: Overwrite existing files. If False, will skip the
    calculation of existing variables.
:return: A named tuple with calculated variables' tables.
File:      ~/Projects/kinactive/kinactive/features.py
Type:      method
[8]:
vs_res = vs.calculate_all_vs(
    chains.collapse_children(), num_proc=N_PROC, base=BASE, overwrite=True
)
INFO:kinactive.features:Calculating sequence variables on canonical seqs
INFO:kinactive.features:Resulting shape: (20, 799)
INFO:kinactive.features:Saved defaults_can_seq_vs.csv to ../data/variable_sets
INFO:kinactive.features:Calculating sequence variables on structure seqs
INFO:kinactive.features:Resulting shape: (186, 799)
INFO:kinactive.features:Saved defaults_str_seq_vs.csv to ../data/variable_sets
INFO:kinactive.features:Calculating ligand variables
INFO:kinactive.features:Resulting shape: (186, 793)
INFO:kinactive.features:Saved default_lig_vs.csv to ../data/variable_sets
INFO:kinactive.features:Calculating structure variables
INFO:kinactive.features:Resulting shape: (186, 1693)
INFO:kinactive.features:Saved default_str_vs.csv to ../data/variable_sets
INFO:kinactive.features:Finished calculations

Calculating all four sets on all domains takes ~1h on 20 cores.