Higher-Level API¶

The discussion so far has focused on operations on a single file level. However, Lahuta also provides a higher level abstraction for processing multiple files at once. This is done through processors class. There are two different processor classes that differ in the way they consume the input data.

Download several GPCRs from AlphaFold¶

In [2]:

Copied!

from lahuta.api.utils import URLs
from lahuta.api import download_structures
from lahuta.api.utils import URLs
from lahuta.api import download_structures

In [3]:

Copied!





# Download GPCR structures
# b2, s5, m1, ox1, cx4
gpcrs = [
    'AF-P07550-F1-model_v4', 
    'AF-A4D2N2-F1-model_v4', 
    'AF-P35372-F1-model_v4', 
    'AF-O43613-F1-model_v4', 
    'AF-P61073-F1-model_v4'
]
# Download GPCR structures
# b2, s5, m1, ox1, cx4
gpcrs = [
    'AF-P07550-F1-model_v4', 
    'AF-A4D2N2-F1-model_v4', 
    'AF-P35372-F1-model_v4', 
    'AF-O43613-F1-model_v4', 
    'AF-P61073-F1-model_v4'
]

In [4]:

Copied!





downloads = download_structures(
    pdb_ids=gpcrs,              # list of AlphaFold IDs
    url=URLs.AlphaFold,         # URL to AlphaFold
    pdb_or_cif='cif',           # file format to download
    dir_loc='data'              # directory to save files
)

downloads
downloads = download_structures(
    pdb_ids=gpcrs,              # list of AlphaFold IDs
    url=URLs.AlphaFold,         # URL to AlphaFold
    pdb_or_cif='cif',           # file format to download
    dir_loc='data'              # directory to save files
)

downloads

Out[4]:

{'AF-P07550-F1-model_v4': 'data/af-p07550-f1-model_v4.cif',
 'AF-A4D2N2-F1-model_v4': 'data/af-a4d2n2-f1-model_v4.cif',
 'AF-P35372-F1-model_v4': 'data/af-p35372-f1-model_v4.cif',
 'AF-O43613-F1-model_v4': 'data/af-o43613-f1-model_v4.cif',
 'AF-P61073-F1-model_v4': 'data/af-p61073-f1-model_v4.cif'}

`CashedFileProcessor`¶

Iterates over a given input list of files or over the contents of a directory and stores the results of a function applied to each file. By default, it will just store the file path. We will write a function that reads in the file and computes NeighborPairs for each file.

In [5]:

Copied!

from lahuta.api import CachedFileProcessor
from lahuta.api import CachedFileProcessor

In [6]:

Copied!

from lahuta.core import Luni, NeighborPairs

def process_neighbors(file: str) -> NeighborPairs:
    """Process a single file."""
    luni = Luni(file)
    return luni.compute_neighbors()
from lahuta.core import Luni, NeighborPairs

def process_neighbors(file: str) -> NeighborPairs:
    """Process a single file."""
    luni = Luni(file)
    return luni.compute_neighbors()

In [7]:

Copied!





processor = CachedFileProcessor(
    file_list=list(downloads.values()), # file paths of downloaded files
    worker=process_neighbors,           # worker to process files (this is the default if not specified)
)
processor.process(n_jobs=2)             # number of jobs to run in parallel
processor = CachedFileProcessor(
    file_list=list(downloads.values()), # file paths of downloaded files
    worker=process_neighbors,           # worker to process files (this is the default if not specified)
)
processor.process(n_jobs=2)             # number of jobs to run in parallel

Access Results¶

In [8]:

Copied!

processor.results
processor.results

Out[8]:

{'af-p07550-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2804 atoms and 13158 pairs>,
 'af-a4d2n2-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2585 atoms and 11791 pairs>,
 'af-p35372-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2903 atoms and 13553 pairs>,
 'af-o43613-f1-model_v4.cif': <Lahuta NeighborPairs class containing 3010 atoms and 13770 pairs>,
 'af-p61073-f1-model_v4.cif': <Lahuta NeighborPairs class containing 2531 atoms and 12431 pairs>}

Note: File Processors require a function/callable that determines what to do with the files. Above we wrote a function that compute atom neighbors for each file. We are free to write any function that takes a file path as input and returns a result. The results are stored in a dictionary where the keys are the file paths and the values are the results of the function.

Let's write a function that returns the protein sequences

In [9]:

Copied!





def process_seqs(file: str) -> Luni:
    """Process a single file."""
    luni = Luni(file)
    return luni.sequence
def process_seqs(file: str) -> Luni:
    """Process a single file."""
    luni = Luni(file)
    return luni.sequence

In [10]:

Copied!





processor = CachedFileProcessor(
    file_list=list(downloads.values()), # file paths of downloaded files
    worker=process_seqs,            # worker to use
)
processor.process()
processor = CachedFileProcessor(
    file_list=list(downloads.values()), # file paths of downloaded files
    worker=process_seqs,            # worker to use
)
processor.process()

In [11]:

Copied!

processor.results
processor.results

Out[11]:

{'af-p07550-f1-model_v4.cif': 'MGQPGNGSAFLLAPNGSHAPDHDVTQERDEVWVVGMGIVMSLIVLAIVFGNVLVITAIAKFERLQTVTNYFITSLACADLVMGLAVVPFGAAHILMKMWTFGNFWCEFWTSIDVLCVTASIETLCVIAVDRYFAITSPFKYQSLLTKNKARVIILMVWIVSGLTSFLPIQMHWYRATHQEAINCYANETCCDFFTNQAYAIASSIVSFYVPLVIMVFVYSRVFQEAKRQLQKIDKSEGRFHVQNLSQVEQDGRTGHGLRRSSKFCLKEHKALKTLGIIMGTFTLCWLPFFIVNIVHVIQDNLIRKEVYILLNWIGYVNSGFNPLIYCRSPDFRIAFQELLCLRRSSLKAYGNGYSSNGNTGEQSGYHVEQEKENKLLCEDLPGTEDFVGHQGTVPSDNIDSQGRNCSTNDSLL',
 'af-a4d2n2-f1-model_v4.cif': 'MDLPVNLTSFSLSTPSPLETNHSLGKDDLRPSSPLLSVFGVLILTLLGFLVAATFAWNLLVLATILRVRTFHRVPHNLVASMAVSDVLVAALVMPLSLVHELSGRRWQLGRRLCQLWIACDVLCCTASIWNVTAIALDRYWSITRHMEYTLRTRKCVSNVMIALTWALSAVISLAPLLFGWGETYSEGSEECQVSREPSYAVFSTVGAFYLPLCVVLFVYWKIYKAAKFRVGSRKTNSVSPISEAVEVKDSAKQPQMVFTVRHATVTFQPEGDTWREQKEQRAALMVGILIGVFVLCWIPFFLTELISPLCSCDIPAIWKSIFLWLGYSNSFFNPLIYTAFNKNYNSAFKNFFSRQH',
 'af-p35372-f1-model_v4.cif': 'MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGGRDSLCPPTGSPSMITAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALATSTLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDFRTPRNAKIINVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFIFAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHIYVIIKALVTIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSNIEQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP',
 'af-o43613-f1-model_v4.cif': 'MEPSATPGAQMGVPPGSREPSPVPPDYEDEFLRYLWRDYLYPKQYEWVLIAAYVAVFVVALVGNTLVCLAVWRNHHMRTVTNYFIVNLSLADVLVTAICLPASLLVDITESWLFGHALCKVIPYLQAVSVSVAVLTLSFIALDRWYAICHPLLFKSTARRARGSILGIWAVSLAIMVPQAAVMECSSVLPELANRTRLFSVCDERWADDLYPKIYHSCFFIVTYLAPLGLMAMAYFQIFRKLWGRQIPGTTSALVRNWKRPSDQLGDLEQGLSGEPQPRARAFLAEVKQMRARRKTAKMLMVVLLVFALCYLPISVLNVLKRVFGMFRQASDREAVYACFTFSHWLVYANSAANPIIYNFLSGKFREQFKAAFSCCLPGLGPCGSLKAPSPRSSASHKSLSLQSRCSISKISEHVVLTSVTTVLP',
 'af-p61073-f1-model_v4.cif': 'MEGISIYTSDNYTEEMGSGDYDSMKEPCFREENANFNKIFLPTIYSIIFLTGIVGNGLVILVMGYQKKLRSMTDKYRLHLSVADLLFVITLPFWAVDAVANWYFGNFLCKAVHVIYTVNLYSSVLILAFISLDRYLAIVHATNSQRPRKLLAEKVVYVGVWIPALLLTIPDFIFANVSEADDRYICDRFYPNDLWVVVFQFQHIMVGLILPGIVILSCYCIIISKLSHSKGHQKRKALKTTVILILAFFACWLPYYIGISIDSFILLEIIKQGCEFENTVHKWISITEALAFFHCCLNPILYAFLGAKFKTSAQHALTSVSRGSSLKILSKGKRGGHSSVSTESESSSFHSS'}

Let's write these sequences to a FASTA file, since we'll need them later.

In [12]:

Copied!





import io

def write_seq_to_fasta(results: dict[str, str], file_name: str) -> None:
    with io.StringIO() as buffer:
        for key, value in results.items():
            buffer.write(f">{key.split('.')[0]}\n{value}\n")
        content = buffer.getvalue()

    with open(file_name, 'w') as f:
        f.write(content)

    return None
import io

def write_seq_to_fasta(results: dict[str, str], file_name: str) -> None:
    with io.StringIO() as buffer:
        for key, value in results.items():
            buffer.write(f">{key.split('.')[0]}\n{value}\n")
        content = buffer.getvalue()

    with open(file_name, 'w') as f:
        f.write(content)

    return None

In [13]:

Copied!

# Write sequences to FASTA file
write_seq_to_fasta(processor.results, 'data/sequences.fasta')
# Write sequences to FASTA file
write_seq_to_fasta(processor.results, 'data/sequences.fasta')

In [14]:

Copied!

# Lahuta provides an easier way to parser and save FASTA files
from lahuta.msa import MSAParser
MSAParser(sequences=processor.results).save('data/sequences.fasta')
# Lahuta provides an easier way to parser and save FASTA files
from lahuta.msa import MSAParser
MSAParser(sequences=processor.results).save('data/sequences.fasta')

`FileProcessor`¶

In [15]:

Copied!

from lahuta.api import FileProcessor

# set operations for neighbor pairs
from lahuta.api import union, intersection, difference, symmetric_difference
from lahuta.api import FileProcessor

# set operations for neighbor pairs
from lahuta.api import union, intersection, difference, symmetric_difference

In [16]:

Copied!

processor = FileProcessor(
    file_list=list(downloads.values())[:2], # file paths of downloaded files
)
processor = FileProcessor(
    file_list=list(downloads.values())[:2], # file paths of downloaded files
)

In [17]:

Copied!

processor.process(union)
processor.process(union)

Out[17]:

<Lahuta NeighborPairs class containing 2944 atoms and 23774 pairs>

In [18]:

Copied!

processor.process(intersection)
processor.process(intersection)

Out[18]:

<Lahuta NeighborPairs class containing 1007 atoms and 1175 pairs>

Note that the input can be any callable. Given the simple syntax supported by NeighborPairs, it is very easy to define custom functions that can be applied to each loaded file iteratively. For instance, defining a function that computes the union of contacts:

In [19]:

Copied!

processor.process(lambda x, y: x + y) # equivalent to `union`
processor.process(lambda x, y: x + y) # equivalent to `union`

Out[19]:

<Lahuta NeighborPairs class containing 2944 atoms and 23774 pairs>

In the above examples we iterate over a list of files, load the data and compute atom neighbors. We then iteratively take the union, intersection, etc. of the neighbor pairs:

luni = Luni('file1.cif')
luni2 = Luni('file2.cif')
...

ns = luni.compute_neighbors()
ns2 = luni2.compute_neighbors()
...

ns = ns.union(ns2)
ns = ns.union(ns3)
ns = ns.union(ns4)
ns = ns.union(ns5)
...

FileProcessor makes these operations easy to use and memory efficient because it does not store the results.

An important `Note` when working with multiple files¶

Comparing NeighborPairs from different proteins requires an additional step.

Different protein files will have different residue numbering, different chain information and different atom data. As such, direclty comparing the `NeighborPairs`` objects will not work. Instead, we need to first align the proteins, map their indices, and only then compare the NeighborPairs objects. This will be discussed in the next section.

For example, if we try to compute the union of all the example GPCRs, we'd get an error:

In [64]:

Copied!





processor = FileProcessor(
    file_list=list(downloads.values()), # file paths of downloaded files
)
# will result in an error
# processor.process(union)
processor = FileProcessor(
    file_list=list(downloads.values()), # file paths of downloaded files
)
# will result in an error
# processor.process(union)

Higher-Level API¶

Download several GPCRs from AlphaFold¶

CashedFileProcessor¶

Access Results¶

FileProcessor¶

An important Note when working with multiple files¶

`CashedFileProcessor`¶

`FileProcessor`¶

An important `Note` when working with multiple files¶