Higher-Level API¶
The discussion so far has focused on operations on a single file level. However, Lahuta
also provides a higher level abstraction for processing multiple files at once. This is done through processors
class. There are two different processor classes that differ in the way they consume the input data.
Download several GPCRs from AlphaFold¶
from lahuta.api.utils import URLs
from lahuta.api import download_structures
# Download GPCR structures
# b2, s5, m1, ox1, cx4
gpcrs = [
'AF-P07550-F1-model_v4',
'AF-A4D2N2-F1-model_v4',
'AF-P35372-F1-model_v4',
'AF-O43613-F1-model_v4',
'AF-P61073-F1-model_v4'
]
downloads = download_structures(
pdb_ids=gpcrs, # list of AlphaFold IDs
url=URLs.AlphaFold, # URL to AlphaFold
pdb_or_cif='cif', # file format to download
dir_loc='data' # directory to save files
)
downloads
{'AF-P07550-F1-model_v4': 'data/af-p07550-f1-model_v4.cif', 'AF-A4D2N2-F1-model_v4': 'data/af-a4d2n2-f1-model_v4.cif', 'AF-P35372-F1-model_v4': 'data/af-p35372-f1-model_v4.cif', 'AF-O43613-F1-model_v4': 'data/af-o43613-f1-model_v4.cif', 'AF-P61073-F1-model_v4': 'data/af-p61073-f1-model_v4.cif'}
CashedFileProcessor
¶
Iterates over a given input list of files or over the contents of a directory and stores the results of a function applied to each file. By default, it will just store the file path. We will write a function that reads in the file and computes NeighborPairs
for each file.
from lahuta.api import CachedFileProcessor
from lahuta.core import Luni, NeighborPairs
def process_neighbors(file: str) -> NeighborPairs:
"""Process a single file."""
luni = Luni(file)
return luni.compute_neighbors()
processor = CachedFileProcessor(
file_list=list(downloads.values()), # file paths of downloaded files
worker=process_neighbors, # worker to process files (this is the default if not specified)
)
processor.process(n_jobs=2) # number of jobs to run in parallel
Access Results¶
processor.results
{'af-p07550-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2804 atoms and 13158 pairs>, 'af-a4d2n2-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2585 atoms and 11791 pairs>, 'af-p35372-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2903 atoms and 13553 pairs>, 'af-o43613-f1-model_v4.cif': <Lahuta NeighborPairs class containing 3010 atoms and 13770 pairs>, 'af-p61073-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2531 atoms and 12431 pairs>}
Note: File Processors require a function/callable that determines what to do with the files. Above we wrote a function that compute atom neighbors for each file. We are free to write any function that takes a file path as input and returns a result. The results are stored in a dictionary where the keys are the file paths and the values are the results of the function.
Let's write a function that returns the protein sequences
def process_seqs(file: str) -> Luni:
"""Process a single file."""
luni = Luni(file)
return luni.sequence
processor = CachedFileProcessor(
file_list=list(downloads.values()), # file paths of downloaded files
worker=process_seqs, # worker to use
)
processor.process()
processor.results
{'af-p07550-f1-model_v4.cif': 'MGQPGNGSAFLLAPNGSHAPDHDVTQERDEVWVVGMGIVMSLIVLAIVFGNVLVITAIAKFERLQTVTNYFITSLACADLVMGLAVVPFGAAHILMKMWTFGNFWCEFWTSIDVLCVTASIETLCVIAVDRYFAITSPFKYQSLLTKNKARVIILMVWIVSGLTSFLPIQMHWYRATHQEAINCYANETCCDFFTNQAYAIASSIVSFYVPLVIMVFVYSRVFQEAKRQLQKIDKSEGRFHVQNLSQVEQDGRTGHGLRRSSKFCLKEHKALKTLGIIMGTFTLCWLPFFIVNIVHVIQDNLIRKEVYILLNWIGYVNSGFNPLIYCRSPDFRIAFQELLCLRRSSLKAYGNGYSSNGNTGEQSGYHVEQEKENKLLCEDLPGTEDFVGHQGTVPSDNIDSQGRNCSTNDSLL', 'af-a4d2n2-f1-model_v4.cif': 'MDLPVNLTSFSLSTPSPLETNHSLGKDDLRPSSPLLSVFGVLILTLLGFLVAATFAWNLLVLATILRVRTFHRVPHNLVASMAVSDVLVAALVMPLSLVHELSGRRWQLGRRLCQLWIACDVLCCTASIWNVTAIALDRYWSITRHMEYTLRTRKCVSNVMIALTWALSAVISLAPLLFGWGETYSEGSEECQVSREPSYAVFSTVGAFYLPLCVVLFVYWKIYKAAKFRVGSRKTNSVSPISEAVEVKDSAKQPQMVFTVRHATVTFQPEGDTWREQKEQRAALMVGILIGVFVLCWIPFFLTELISPLCSCDIPAIWKSIFLWLGYSNSFFNPLIYTAFNKNYNSAFKNFFSRQH', 'af-p35372-f1-model_v4.cif': 'MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGGRDSLCPPTGSPSMITAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALATSTLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDFRTPRNAKIINVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFIFAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHIYVIIKALVTIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSNIEQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP', 'af-o43613-f1-model_v4.cif': 'MEPSATPGAQMGVPPGSREPSPVPPDYEDEFLRYLWRDYLYPKQYEWVLIAAYVAVFVVALVGNTLVCLAVWRNHHMRTVTNYFIVNLSLADVLVTAICLPASLLVDITESWLFGHALCKVIPYLQAVSVSVAVLTLSFIALDRWYAICHPLLFKSTARRARGSILGIWAVSLAIMVPQAAVMECSSVLPELANRTRLFSVCDERWADDLYPKIYHSCFFIVTYLAPLGLMAMAYFQIFRKLWGRQIPGTTSALVRNWKRPSDQLGDLEQGLSGEPQPRARAFLAEVKQMRARRKTAKMLMVVLLVFALCYLPISVLNVLKRVFGMFRQASDREAVYACFTFSHWLVYANSAANPIIYNFLSGKFREQFKAAFSCCLPGLGPCGSLKAPSPRSSASHKSLSLQSRCSISKISEHVVLTSVTTVLP', 'af-p61073-f1-model_v4.cif': 'MEGISIYTSDNYTEEMGSGDYDSMKEPCFREENANFNKIFLPTIYSIIFLTGIVGNGLVILVMGYQKKLRSMTDKYRLHLSVADLLFVITLPFWAVDAVANWYFGNFLCKAVHVIYTVNLYSSVLILAFISLDRYLAIVHATNSQRPRKLLAEKVVYVGVWIPALLLTIPDFIFANVSEADDRYICDRFYPNDLWVVVFQFQHIMVGLILPGIVILSCYCIIISKLSHSKGHQKRKALKTTVILILAFFACWLPYYIGISIDSFILLEIIKQGCEFENTVHKWISITEALAFFHCCLNPILYAFLGAKFKTSAQHALTSVSRGSSLKILSKGKRGGHSSVSTESESSSFHSS'}
Let's write these sequences to a FASTA file, since we'll need them later.
import io
def write_seq_to_fasta(results: dict[str, str], file_name: str) -> None:
with io.StringIO() as buffer:
for key, value in results.items():
buffer.write(f">{key.split('.')[0]}\n{value}\n")
content = buffer.getvalue()
with open(file_name, 'w') as f:
f.write(content)
return None
# Write sequences to FASTA file
write_seq_to_fasta(processor.results, 'data/sequences.fasta')
# Lahuta provides an easier way to parser and save FASTA files
from lahuta.msa import MSAParser
MSAParser(sequences=processor.results).save('data/sequences.fasta')
FileProcessor
¶
from lahuta.api import FileProcessor
# set operations for neighbor pairs
from lahuta.api import union, intersection, difference, symmetric_difference
processor = FileProcessor(
file_list=list(downloads.values())[:2], # file paths of downloaded files
)
processor.process(union)
<Lahuta NeighborPairs class containing 2944 atoms and 23774 pairs>
processor.process(intersection)
<Lahuta NeighborPairs class containing 1007 atoms and 1175 pairs>
Note that the input can be any callable. Given the simple syntax supported by NeighborPairs, it is very easy to define custom functions that can be applied to each loaded file iteratively. For instance, defining a function that computes the union of contacts:
processor.process(lambda x, y: x + y) # equivalent to `union`
<Lahuta NeighborPairs class containing 2944 atoms and 23774 pairs>
In the above examples we iterate over a list of files, load the data and compute atom neighbors. We then iteratively take the union, intersection, etc. of the neighbor pairs:
luni = Luni('file1.cif')
luni2 = Luni('file2.cif')
...
ns = luni.compute_neighbors()
ns2 = luni2.compute_neighbors()
...
ns = ns.union(ns2)
ns = ns.union(ns3)
ns = ns.union(ns4)
ns = ns.union(ns5)
...
FileProcessor
makes these operations easy to use and memory efficient because it does not store the results.
An important Note
when working with multiple files¶
Comparing NeighborPairs
from different proteins requires an additional step.
Different protein files will have different residue numbering, different chain information and different atom data. As such, direclty comparing the `NeighborPairs`` objects will not work. Instead, we need to first align the proteins, map their indices, and only then compare the NeighborPairs objects. This will be discussed in the next section.
For example, if we try to compute the union of all the example GPCRs, we'd get an error:
processor = FileProcessor(
file_list=list(downloads.values()), # file paths of downloaded files
)
# will result in an error
# processor.process(union)