KinActive+lXtractor tutorial

In this tutorial we’ll demonstrate how to parse/prepare and annotate PK domains’ conformational states.

[1]:

from itertools import chain
from pathlib import Path
from tempfile import TemporaryDirectory

from lXtractor.core.chain import ChainIO, ChainInitializer, ChainList, recover as recover_tree
from lXtractor.ext import AlphaFold, PDB, PyHMMer
from lXtractor.variables.structural import *
from kinactive import DBConfig, load_dfg, load_kinactive, calculate

/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/numpy/core/getlimits.py:518: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_clustering.py:35: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _pt_shuffle_rec(i, indexes, index_mask, partition_tree, M, pos):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_clustering.py:54: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def delta_minimization_order(all_masks, max_swap_size=100, num_passes=2):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_clustering.py:63: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _reverse_window(order, start, length):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_clustering.py:69: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _reverse_window_score_gain(masks, order, start, length):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_clustering.py:77: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _mask_delta_score(m1, m2):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/links.py:5: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def identity(x):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/links.py:10: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _identity_inverse(x):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/links.py:15: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def logit(x):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/links.py:20: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _logit_inverse(x):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_masked_model.py:363: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _build_fixed_single_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_masked_model.py:385: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _build_fixed_multi_output(averaged_outs, last_outs, outputs, batch_positions, varying_rows, num_varying_rows, link, linearizing_weights):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_masked_model.py:428: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _init_masks(cluster_matrix, M, indices_row_pos, indptr):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/utils/_masked_model.py:439: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _rec_fill_masks(cluster_matrix, indices_row_pos, indptr, indices, M, ind):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/maskers/_tabular.py:186: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _single_delta_mask(dind, masked_inputs, last_mask, data, x, noop_code):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/maskers/_tabular.py:197: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _delta_masking(masks, x, curr_delta_inds, varying_rows_out,
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/maskers/_image.py:175: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def _jit_build_partition_tree(xmin, xmax, ymin, ymax, zmin, zmax, total_ywidth, total_zwidth, M, clustering, q):
/home/edik/miniconda3/envs/kinactive/lib/python3.10/site-packages/shap/explainers/_partition.py:676: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def lower_credit(i, value, M, values, clustering):
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.

Obtaining and preparing initial data

For simplicity, unlike during the creation of the lXt-PK collection, we’ll omit using canonical UniProt sequences and extract and annotate PK domains from arbitrary PK structures.

There are various ways to obtain the initial structures: 1. From local files. 2. From PDB. 3. From AlphaFold database.

In each case, we’ll need to obtain the objects of a class ChainStructure: a container class encapsulating structure and sequence data.

1. Parsing local files

Here, we’ll show how to parse locally stored structures. This could be done by first reading these files into a GenericStructure and supplying the result to ChainStructure.from_structure() method. However, to generalize the case to as many structures as desired, we’ll use ChainInitializer class to handle parsing the structures. There are several things to note regarding the initialization process.

ChainInitializer will recognize the “.cif” format.
It will read structures and split them into polymer chains.
- => This is required to initialize a ChainStructure as it encompasses ChainSequence and GenericStructure holding a single polymer sequence and coordinates.
- => If a structure had chains A and B, there will be two ChainStructures initialized from it.
Increasing num_proc for ChainInitializer parallelizes over the provided inputs.

[2]:

# my_structures is a dir that stores structures in a .cif format
LOCAL_FILES = Path('my_structures/').glob('*.cif')

[3]:

init = ChainInitializer(verbose=True)

[4]:

structures = ChainList(init.from_iterable(LOCAL_FILES))

2. Fetching PDB structures

For demonstration, we’ll fetch PDB structures of two Src kinases.

For this, we’ll use the PDB and AlphaFold interfaces. They are very similar. Thus, we’ll use a single function to fetch and parse the structures.

[5]:

def fetch_and_parse_structures(
    codes, fetcher, init=ChainInitializer(verbose=True), **kwargs
):
    with TemporaryDirectory() as tmp_dir:
        tmp_dir = Path(tmp_dir)
        fetched, failed = fetcher.fetch_structures(codes, dir_=tmp_dir)
        print(f'Files fetched: {fetched}; missed: {failed}')
        parsed = init.from_iterable(tmp_dir.glob('*.cif'), **kwargs)
        structures = ChainList(chain.from_iterable(parsed))
    return structures

[6]:

PDB_CODES = [
    '2SRC', '2OIQ'
]

[7]:

structures = fetch_and_parse_structures(
    PDB_CODES, PDB(num_threads=2, verbose=True)
)

Files fetched: [(('2SRC', 'cif'), PosixPath('/tmp/tmpkvk8tcfo/2SRC.cif')), (('2OIQ', 'cif'), PosixPath('/tmp/tmpkvk8tcfo/2OIQ.cif'))]; missed: []

3. Fetching AlphaFold structures

[8]:

UNIPROT_CODES = [
    'P00519',  # Abl1
    'O60674',  # Jak2
]

[9]:

structures = fetch_and_parse_structures(
    UNIPROT_CODES, AlphaFold(num_threads=2, verbose=True)
)

Files fetched: [(('P00519', 'cif'), PosixPath('/tmp/tmpx6k39spg/P00519.cif')), (('O60674', 'cif'), PosixPath('/tmp/tmpx6k39spg/O60674.cif'))]; missed: []

Extracting PK domains

If you execute cells in order, the structures will be a ChainList with the predicted structures of Abl1 and Jak2 protein kinases. It behaves like a regular list, but has some additional useful methods defined for Chain*-type objects, like the ChainStructure-type objects here.

[10]:

structures

[10]:

[ChainStructure(P00519:A|1-1130), ChainStructure(O60674:A|1-1132)]

Next, we’ll use PyHMMer to extract domain sequences from the obtained structures.

Each ChainStructure contains .seq attribute with ChainSequence corresponding to the PDB structure chain’s sequence, which PyHMMer will use to extract the domain sequence from. The discovered domain boundaries will enable subsetting the array with coordinates. Internally, this is handled by the ChainStructure.spawn_child() method. Each Chain*-type object (Chain, ChainSequence, and ChainStructure) has this method.

[11]:

cfg = DBConfig()
cfg

[11]:

DBConfig(verbose=True, target_dir=PosixPath('db'), pdb_dir=PosixPath('pdb/structures'), pdb_dir_info=PosixPath('pdb/info'), seq_dir=PosixPath('uniprot/fasta'), max_fetch_trials=2, io_cpus=1, init_cpus=1, init_map_numbering_cpus=1, profile=PosixPath('/home/edik/Projects/kinactive/kinactive/resources/Pkinase.hmm'), pk_map_name='PK', pk_min_score=50, pk_min_seq_domain_size=150, pk_min_str_domain_size=100, pk_min_cov_hmm=0.7, pk_min_cov_seq=0.7, pk_min_str_seq_match=0.9, min_seq_size=150, max_seq_size=3000, pdb_fmt='cif', pdb_num_fetch_threads=10, pdb_str_min_size=100, uniprot_chunk_size=100, uniprot_num_fetch_threads=10)

[12]:

pyhmm = PyHMMer(cfg.profile)
domains = ChainList(pyhmm.annotate(
    structures,
    new_map_name=cfg.pk_map_name,
    min_score=cfg.pk_min_score,
    min_size=cfg.pk_min_str_domain_size,
    min_cov_hmm=cfg.pk_min_cov_hmm,
    min_cov_seq=cfg.pk_min_cov_seq
))
domains

[12]:

[ChainStructure(PK_2|548-802<-(O60674:A|1-1132)), ChainStructure(PK_3|851-1119<-(O60674:A|1-1132)), ChainStructure(PK_1|244-491<-(P00519:A|1-1130))]

We didn’t have to use an additional variable for the extracted domains. Each subsequence is treated as a level of annotation corresponding to a sub-segment of an original sequence. Each ChainStructure in structures is modified in-place, where “modification” is simply assigning a .child attribute corresponding to the extracted domain.

[13]:

s1 = structures[0]
s1.children

[13]:

[ChainStructure(PK_1|244-491<-(P00519:A|1-1130))]

[14]:

# Get the first level of annotation (.children of each structure)
domains2 = next(structures.iter_children())

[15]:

domains2

[15]:

[ChainStructure(PK_1|244-491<-(P00519:A|1-1130)), ChainStructure(PK_2|548-802<-(O60674:A|1-1132)), ChainStructure(PK_3|851-1119<-(O60674:A|1-1132))]

Calculating structural variables

Next, we’ll calculate structural variables on the annotated domains. The model itself provides a list of feature names necessary for the encompassed models. The cool part about lXtractor’s variables is the property eval(var.id) == var, i.e., each variable can be initialized through its string representation constructed via .id property.

[16]:

def init_features(feature_names):
    return list(map(eval, feature_names))

[17]:

dfg, ka = load_dfg(), load_kinactive()

Trying to unpickle estimator LogisticRegression from version 1.1.2 when using version 1.2.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations

[18]:

var_names = list(set(dfg.dfg_features) | set(ka.features))
print(f'Variables: {var_names[:5]}, ... ({len(var_names)})')
features = init_features(var_names)
features[:5]

Variables: ["Dist(p1=125,p2=141,a1='CB',a2='CB',com=False)", "Dist(p1=78,p2=142,a1='CB',a2='CZ',com=False)", 'PseudoDihedral(p1=158,p2=159,p3=160,p4=161)', 'PseudoDihedral(p1=102,p2=103,p3=104,p4=105)', "Dist(p1=14,p2=142,a1='CB',a2='CB',com=False)"], ... (234)

[18]:

[Dist(p1=125,p2=141,a1='CB',a2='CB',com=False),
 Dist(p1=78,p2=142,a1='CB',a2='CZ',com=False),
 PseudoDihedral(p1=158,p2=159,p3=160,p4=161),
 PseudoDihedral(p1=102,p2=103,p3=104,p4=105),
 Dist(p1=14,p2=142,a1='CB',a2='CB',com=False)]

We’ll supply the chains and features into a calculate method that will handle variables’ calculation and aggregation into a single table. A very important argument here is the map_name: since feature positions are given in the PK HMM nodes’ numbering, providing map_name is needed to supply the seq-to-HMM numbering mappings into the calculator.

[19]:

vs = calculate(structures.collapse_children(), features, map_name=cfg.pk_map_name)
vs.shape

[19]:

(3, 235)

Predicting the conformational labels

As usual, given proper setup, this is the easiest part.

The predict method will output raw numerical labels.

[20]:

dfg.predict(vs), ka.predict(vs)

[20]:

(array([0, 0, 0]), array([1, 0, 1]))

For a complete result, we’ll use the vs table with the predict_full() that will keep all intermediate variables and give them proper names.

[21]:

vs = dfg.predict_full(vs)  # Predict all DFG labels
vs['ObjectID'] = [c.id for c in structures.collapse_children()]  # Add object IDs
vs['IsActive'] = ka.predict(vs).astype(bool)  # Predict Active/Inactive
vs[[c for c in vs.columns if '(' not in c]]  # Check the predictions

[21]:

	ObjectID	in_proba	out_proba	other_proba	in_meta_proba	DFG_pred	IsActive
0	ChainStructure(PK_1\|244-491<-(P00519:A\|1-1130))	0.999849	0.000307	0.000676	1.0	in	True
1	ChainStructure(PK_2\|548-802<-(O60674:A\|1-1132))	0.999385	0.000646	0.005701	1.0	in	False
2	ChainStructure(PK_3\|851-1119<-(O60674:A\|1-1132))	0.999801	0.000365	0.000655	1.0	in	True

Saving the data

[22]:

BASE = Path('where/to/save')
BASE.mkdir(exist_ok=True, parents=True)

vs now contains both calculated variables and predictions.

[23]:

vs.to_csv(BASE / 'vs_and_pred.csv', index=False)

To save the extracted domains, we’ll use ChainIO class. The key here is that each Chain*-type objects defines the .save method. ChainIO.write accepts keyword arguments passed to .save. By default, the annotations are not saved. One can enable saving all extracted child structures by providing write_children=True).

[24]:

io = ChainIO(verbose=True)
list(io.write(structures, BASE / 'lXt', write_children=True))

[24]:

[PosixPath('where/to/save/lXt/ChainStructure(P00519:A|1-1130)'),
 PosixPath('where/to/save/lXt/ChainStructure(O60674:A|1-1132)')]

[25]:

! tree where/to/save

where/to/save
├── lXt
│   ├── ChainStructure(O60674:A|1-1132)
│   │   ├── meta.tsv
│   │   ├── segments
│   │   │   ├── PK_2
│   │   │   │   ├── meta.tsv
│   │   │   │   ├── sequence.tsv
│   │   │   │   └── structure.cif
│   │   │   └── PK_3
│   │   │       ├── meta.tsv
│   │   │       ├── sequence.tsv
│   │   │       └── structure.cif
│   │   ├── sequence.tsv
│   │   └── structure.cif
│   └── ChainStructure(P00519:A|1-1130)
│       ├── meta.tsv
│       ├── segments
│       │   └── PK_1
│       │       ├── meta.tsv
│       │       ├── sequence.tsv
│       │       └── structure.cif
│       ├── sequence.tsv
│       └── structure.cif
└── vs_and_pred.csv

8 directories, 16 files

One can use the same io object to load the data. We’ll also load child segments by setting search_children=True.

[26]:

loaded = ChainList(io.read_chain_str(BASE / 'lXt', search_children=True)).apply(recover_tree)

[27]:

loaded

[27]:

[ChainStructure(O60674:A|1-1132), ChainStructure(P00519:A|1-1130)]

[28]:

loaded.collapse_children()

[28]:

[ChainStructure(PK_3|851-1119<-(O60674:A|1-1132)), ChainStructure(PK_2|548-802<-(O60674:A|1-1132)), ChainStructure(PK_1|244-491<-(P00519:A|1-1130))]

Conclusion

In this tutorial, we went through a simplified pipeline of extracting and annotating PK domains from two AlphaFold-predicted structures. The key difference with the lXt-PK data collection is that the latter also employed canonical UniProt sequences and worked with Chain-type objects. These encapsulate both the canonical sequence under .seq and a ChainList of associated structures under .structures.

This tutorial may serve as an entry point for those wishing to create a collection of PK domains and derive their DFG and Active/Inactive labels. Users may change the initial data and customize the process, play around with intermediate data and so on. If you struggle or find a bug, feel free `to raise an issue <>`__.

[29]:

! rm -r where/to/save