Building an initial collection of PK domains
Here, we’ll build a collection of PK domains from scratch. We’ll use UniProt sequences in the SIFTS database to map UniProt to PDB IDs. We’ll find domains in the SIFTS sequences and fetch the associated PDB structures for successful hits. Using these boundaries, we’ll transfer the discovered domain boundaries to PDB structures and subset each sequence and structure domain. The accompanying paper provides a more detailed description of this process. Also, don’t hesitate to inspect the docs (they also provide links to the relevant source code) or raise an issue.
Completing this notebook may depend on the internet connection and the PC used. Here, we’ll use a laptop with 24-core 13th gen Intel processor and 32GB RAM.
[1]:
import logging
import warnings
from pathlib import Path
with warnings.catch_warnings():
warnings.simplefilter("ignore")
from kinactive import DB, DBConfig
[2]:
logging.basicConfig(level=logging.INFO)
[3]:
DATA = Path('../data') # A path to the directory where data will be stored.
DATA.mkdir(exist_ok=True)
REPRODUCE = False
N_SEQ_DOMAINS = 3 # Restrict the number of processed canonical sequence domains for demonstration
if REPRODUCE:
from kinactive.io import load_txt_lines
# Replace with your paths if needed
uni_list_path = Path('../data/submit/IDlists/UniProt_ids.txt')
pdb_list_path = Path('../data/submit/IDlists/PDB_ids.txt')
uni_ids = load_txt_lines(uni_list_path)
pdb_ids = load_txt_lines(pdb_list_path)
else:
uni_ids, pdb_ids = None, None
cfg = DBConfig(
verbose=True,
target_dir=DATA / 'lXt-PK',
pdb_dir=DATA / 'pdb' / 'cif',
pdb_dir_info=DATA / 'pdb' / 'info',
seq_dir=DATA / 'uniprot' / 'fasta',
io_cpus=10,
init_map_numbering_cpus=10,
init_cpus=10
)
db = DB(cfg)
DB is built according to settings specified in a DBConfig dataclass. Consult with the docs to see what the various options mean.
[4]:
?DBConfig
Init signature:
DBConfig(
verbose: bool = True,
target_dir: pathlib.Path = PosixPath('db'),
pdb_dir: pathlib.Path = PosixPath('pdb/structures'),
pdb_dir_info: pathlib.Path = PosixPath('pdb/info'),
seq_dir: pathlib.Path = PosixPath('uniprot/fasta'),
max_fetch_trials: int = 2,
io_cpus: int = 1,
init_cpus: int = 1,
init_map_numbering_cpus: int = 1,
profile: pathlib.Path = PosixPath('/home/edik/Projects/kinactive/kinactive/resources/Pkinase.hmm'),
pk_map_name: str = 'PK',
pk_min_score: float = 50,
pk_min_seq_domain_size: int = 150,
pk_min_str_domain_size: int = 100,
pk_min_cov_hmm: float = 0.7,
pk_min_cov_seq: float = 0.7,
pk_min_str_seq_match: float = 0.9,
min_seq_size: int = 150,
max_seq_size: int = 3000,
pdb_fmt: str = 'cif',
pdb_num_fetch_threads: int = 10,
pdb_str_min_size: int = 100,
uniprot_chunk_size: int = 100,
uniprot_num_fetch_threads: int = 10,
) -> None
Docstring:
Database config.
Default parameters were used to create lXt-PK data collection.
To reproduce locally, you may change the paths (``*_dir*``) and adjust
the number of cpus (``*_cpus``).
File: ~/Projects/kinactive/kinactive/config.py
Type: type
Subclasses:
[5]:
?db.build
Signature:
db.build(
uniprot_ids: collections.abc.Collection[str] | None = None,
pdb_chain_ids: collections.abc.Collection[str] | None = None,
n_domains: int = 0,
) -> lXtractor.core.chain.list.ChainList[lXtractor.core.chain.chain.Chain]
Docstring:
Build a new lXt-PK data collection.
:param uniprot_ids: An optional list of UniProt IDs to restrict
the db to.
:param pdb_chain_ids: An optional collection of PDB chains to restrict
the db to. Format: "{PDB_ID}:{ChainID}".
:param n_domains: Use n random sequence domains. It is helpful for
testing the pipeline.
:return: A :class:`ChainList` of :class:`Chain` objects having at least
one child PK domain with at least one PK domain structure passing
the filtering thresholds.
File: ~/Projects/kinactive/kinactive/db.py
Type: method
[6]:
%%time
db.build(uni_ids, pdb_ids, n_domains=N_SEQ_DOMAINS);
INFO:kinactive.db:205 remaining sequences to fetch.
INFO:kinactive.db:Got 61750 seqs from ../data/uniprot/fasta
INFO:kinactive.db:Filtered to 49701 seqs in [150, 3000]
INFO:kinactive.db:Found 680 PK domains within 666 seqs.
INFO:kinactive.db:Sampled to 3 random initial domains.
INFO:kinactive.db:Fetching info for 19 PDB IDs.
INFO:kinactive.db:Filtered to 18 X-ray PDB IDs out of 19.
INFO:kinactive.db:Fetching 18 X-ray structures
INFO:kinactive.db:Initialized 2 `Chain` objects.
INFO:kinactive.db:Filtered to 29 out of 29 domain structures having >=100 extracted domain size and >=0.9 canonical seq match fraction.
INFO:kinactive.db:Filtered to 2 out of 2 domains with at least one valid structures.
INFO:kinactive.db:Filtered to 2 chains out of 2 with at least one extracted domains.
CPU times: user 10 s, sys: 732 ms, total: 10.7 s
Wall time: 42.5 s
[7]:
%%time
if len(db.chains) > 0:
db.save(overwrite=True)
INFO:kinactive.db:Saved summary file initial_seq_summary.csv to ../data/lXt-PK
INFO:kinactive.db:Saved summary file initial_str_summary.csv to ../data/lXt-PK
INFO:kinactive.db:Saved summary file domain_seq_summary.csv to ../data/lXt-PK
INFO:kinactive.db:Saved summary file domain_str_summary.csv to ../data/lXt-PK
CPU times: user 94.8 ms, sys: 67.7 ms, total: 162 ms
Wall time: 1.87 s