Building an initial collection of PK domains

Here, we’ll build a collection of PK domains from scratch. We’ll use UniProt sequences in the SIFTS database to map UniProt to PDB IDs. We’ll find domains in the SIFTS sequences and fetch the associated PDB structures for successful hits. Using these boundaries, we’ll transfer the discovered domain boundaries to PDB structures and subset each sequence and structure domain. The accompanying paper provides a more detailed description of this process. Also, don’t hesitate to inspect the docs (they also provide links to the relevant source code) or raise an issue.

Completing this notebook may depend on the internet connection and the PC used. Here, we’ll use a laptop with 24-core 13th gen Intel processor and 32GB RAM.

[1]:
import logging
import warnings
from pathlib import Path

from kinactive import DB, DBConfig
[2]:
logging.basicConfig(level=logging.INFO)
[3]:
DATA = Path('../data')  # A path to the directory where data will be stored.
DATA.mkdir(exist_ok=True)
REPRODUCE = False
N_SEQ_DOMAINS = 3  # Restrict the number of processed canonical sequence domains for demonstration

if REPRODUCE:
    from kinactive.io import load_txt_lines
    # Replace with your paths if needed
    uni_list_path = Path('../data/submit/IDlists/UniProt_ids.txt')
    pdb_list_path = Path('../data/submit/IDlists/PDB_ids.txt')

    uni_ids = load_txt_lines(uni_list_path)
    pdb_ids = load_txt_lines(pdb_list_path)
else:
    uni_ids, pdb_ids = None, None

cfg = DBConfig(
    verbose=True,
    target_dir=DATA / 'lXt-PK-test',
    pdb_dir=DATA / 'pdb' / 'cif',
    pdb_dir_info=DATA / 'pdb' / 'info',
    seq_dir=DATA / 'uniprot' / 'fasta',
    io_cpus=10,
    init_map_numbering_cpus=10,
    init_cpus=10
)
db = DB(cfg)

DB is built according to settings specified in a DBConfig dataclass. Consult with the docs to see what the various options mean.

[4]:
?DBConfig
Init signature:
DBConfig(
    verbose: bool = True,
    target_dir: pathlib.Path = PosixPath('db'),
    pdb_dir: pathlib.Path = PosixPath('pdb/structures'),
    pdb_dir_info: pathlib.Path = PosixPath('pdb/info'),
    seq_dir: pathlib.Path = PosixPath('uniprot/fasta'),
    max_fetch_trials: int = 2,
    io_cpus: int = 1,
    init_cpus: int = 1,
    init_map_numbering_cpus: int = 1,
    init_add_structure_cpus: int = 1,
    init_tolerate_failures: bool = True,
    profile: pathlib.Path = PosixPath('/home/edik/Projects/kinactive/kinactive/resources/PF00069.hmm'),
    tk2pk: pathlib.Path = PosixPath('/home/edik/Projects/kinactive/kinactive/resources/tk2pk.json'),
    pk_map_name: str = 'PK',
    pk_min_score: float = 50,
    pk_min_seq_domain_size: int = 150,
    pk_min_str_domain_size: int = 100,
    pk_min_cov_hmm: float = 0.5,
    pk_min_cov_seq: float = 0.5,
    pk_min_str_seq_match: float = 0.8,
    min_seq_size: int = 150,
    max_seq_size: int = 5000,
    pdb_fmt: str = 'mmtf.gz',
    pdb_num_fetch_threads: int = 10,
    pdb_str_min_size: int = 100,
    uniprot_chunk_size: int = 100,
    uniprot_num_fetch_threads: int = 10,
) -> None
Docstring:
Database config.

Default parameters were used to create lXt-PK data collection.
To reproduce locally, you may change the paths (``*_dir*``) and adjust
the number of cpus (``*_cpus``).
File:           ~/Projects/kinactive/kinactive/config.py
Type:           type
Subclasses:
[5]:
?db.build
Signature:
db.build(
    uniprot_ids: collections.abc.Collection[str] | None = None,
    pdb_chain_ids: collections.abc.Collection[str] | None = None,
    n_domains: int = 0,
) -> lXtractor.chain.list.ChainList[lXtractor.chain.chain.Chain]
Docstring:
Build a new lXt-PK data collection.

:param uniprot_ids: An optional list of UniProt IDs to restrict
    the db to.
:param pdb_chain_ids: An optional collection of PDB chains to restrict
    the db to. Format: "{PDB_ID}:{ChainID}".
:param n_domains: Use n random sequence domains. It is helpful for
    testing the pipeline.
:return: A :class:`ChainList` of :class:`Chain` objects having at least
    one child PK domain with at least one PK domain structure passing
    the filtering thresholds.
File:      ~/Projects/kinactive/kinactive/db.py
Type:      method
[6]:
%%time

db.build(uni_ids, pdb_ids, n_domains=N_SEQ_DOMAINS);
INFO:kinactive.db:311 remaining sequences to fetch.
INFO:kinactive.db:Got 63947 seqs from ../data/uniprot/fasta
INFO:kinactive.db:Filtered to 51863 seqs in [150, 5000]
INFO:kinactive.db:Annotating domains
INFO:kinactive.db:Discovered 715 sequences with domain hits
INFO:kinactive.db:Initial TK hits: 702
INFO:kinactive.db:Initial PK hits: 743
INFO:kinactive.db:Transferring PK profile maps to TK hits
INFO:kinactive.db:Filtered to 705 sequences with at least one valid domain with conforming to config criteria.
INFO:kinactive.db:Final TK hits: 214
INFO:kinactive.db:Final PK hits: 508
INFO:kinactive.db:Sampled to 3 random initial domains.
INFO:kinactive.db:Fetching info for 92 PDB IDs.
INFO:kinactive.db:Filtered to 92 X-ray PDB IDs out of 92.
INFO:kinactive.db:Fetching 92 X-ray structures
WARNING:lXtractor.core.structure:Structure 1XH9 has 37 uncategorized atoms.
INFO:kinactive.db:Initialized 3 `Chain` objects.
INFO:kinactive.db:Filtered to 137 out of 137 domain structures having >=100 extracted domain size and >=0.8 canonical seq match fraction.
INFO:kinactive.db:Filtered to 3 out of 3 domains with at least one valid structure.
INFO:kinactive.db:Filtered to 3 chains out of 3 with at least one extracted domains.
CPU times: user 1min 8s, sys: 2.96 s, total: 1min 11s
Wall time: 2min 20s
[7]:
%%time

if len(db.chains) > 0:
    db.save(overwrite=True)
INFO:kinactive.db:Saved summary file initial_seq_summary.csv to ../data/lXt-PK-test
INFO:kinactive.db:Saved summary file initial_str_summary.csv to ../data/lXt-PK-test
INFO:kinactive.db:Saved summary file domain_seq_summary.csv to ../data/lXt-PK-test
INFO:kinactive.db:Saved summary file domain_str_summary.csv to ../data/lXt-PK-test
CPU times: user 3.14 s, sys: 611 ms, total: 3.75 s
Wall time: 7.2 s