Tracing Contacts Across Evolutionary Time¶

We now have all the incredients to compute contacts between different proteins and meaningfully compare them. This is a very powerful feature as it allows us to use the feature rich operations supported by NeighborPairs objects and apply them to contacts between different proteins. All thanks to the Multiple Sequence Alignment (MSA) which serves as a common reference frame for all proteins.

There are two things to keep in mind:

We still have a difference of resolutions. The MSA is computed at the residue level, but contacts are computed at the atom level. This means we are on our own when it comes to mapping atoms for each residue. Luckily, Lahuta helps us with this as well.
MSA are only meaningful for proteins, whereas biomolecular systems (input files) can and usually do contain other molecules such as DNA, RNA, ligands, etc. This means that we can only use the MSA to map protein residues. Again, Lahuta helps us with this as well.

How does Lahuta handle these two issues?

When mapping atoms using a provided sequence, Lahuta will produce a LabeledNeighborPairs object, rather than a NeighborPairs object. This new object is better suited for comparing contacts between different proteins as it provides several convenience methods. We'll go over the specifics in a bit.
MSA are only meaningful for proteins and nucleic acids. Lahuta performs two different types of mappings:
1. proteins are mapped according to the provided sequence.
2. non proteins are mapped by shifting their residue index by the number of residues in the protein.

Let's see how this works in practice.

In [2]:

Copied!

from lahuta import Luni
from lahuta.msa import MSAParser

from lahuta.api import CachedFileProcessor
from lahuta import Luni
from lahuta.msa import MSAParser

from lahuta.api import CachedFileProcessor

In [3]:

Copied!

def func(file_path: str) -> str:
    luni = Luni(file_path)
    return luni.sequence
def func(file_path: str) -> str:
    luni = Luni(file_path)
    return luni.sequence

In [4]:

Copied!





processor = CachedFileProcessor(
    file_list=['data/af-a4d2n2-f1-model_v4.cif', 'data/af-o43613-f1-model_v4.cif'], 
    worker=func,
)
processor.process()
processor = CachedFileProcessor(
    file_list=['data/af-a4d2n2-f1-model_v4.cif', 'data/af-o43613-f1-model_v4.cif'], 
    worker=func,
)
processor.process()

In [5]:

Copied!

processor.results
processor.results

Out[5]:

{'af-a4d2n2-f1-model_v4.cif': 'MDLPVNLTSFSLSTPSPLETNHSLGKDDLRPSSPLLSVFGVLILTLLGFLVAATFAWNLLVLATILRVRTFHRVPHNLVASMAVSDVLVAALVMPLSLVHELSGRRWQLGRRLCQLWIACDVLCCTASIWNVTAIALDRYWSITRHMEYTLRTRKCVSNVMIALTWALSAVISLAPLLFGWGETYSEGSEECQVSREPSYAVFSTVGAFYLPLCVVLFVYWKIYKAAKFRVGSRKTNSVSPISEAVEVKDSAKQPQMVFTVRHATVTFQPEGDTWREQKEQRAALMVGILIGVFVLCWIPFFLTELISPLCSCDIPAIWKSIFLWLGYSNSFFNPLIYTAFNKNYNSAFKNFFSRQH',
 'af-o43613-f1-model_v4.cif': 'MEPSATPGAQMGVPPGSREPSPVPPDYEDEFLRYLWRDYLYPKQYEWVLIAAYVAVFVVALVGNTLVCLAVWRNHHMRTVTNYFIVNLSLADVLVTAICLPASLLVDITESWLFGHALCKVIPYLQAVSVSVAVLTLSFIALDRWYAICHPLLFKSTARRARGSILGIWAVSLAIMVPQAAVMECSSVLPELANRTRLFSVCDERWADDLYPKIYHSCFFIVTYLAPLGLMAMAYFQIFRKLWGRQIPGTTSALVRNWKRPSDQLGDLEQGLSGEPQPRARAFLAEVKQMRARRKTAKMLMVVLLVFALCYLPISVLNVLKRVFGMFRQASDREAVYACFTFSHWLVYANSAANPIIYNFLSGKFREQFKAAFSCCLPGLGPCGSLKAPSPRSSASHKSLSLQSRCSISKISEHVVLTSVTTVLP'}

In [6]:

Copied!

parser = MSAParser('data/custom_alignment.fasta')
parser.sequences
parser = MSAParser('data/custom_alignment.fasta')
parser.sequences

Out[6]:

{'af-p07550-f1-model_v4': Seq('-----------------------------------------MGQPGNGSAFLLA...SLL'),
 'af-a4d2n2-f1-model_v4': Seq('-------------------------------MDLPVNLTSFSLSTPSPLETNHS...---'),
 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...---'),
 'af-o43613-f1-model_v4': Seq('MEPSA---------TPGAQMGVPPGSR------------EPSPVPPDYED-EFL...---'),
 'af-p61073-f1-model_v4': Seq('-------------------------------------MEGISIYTSDNYTEEMG...---')}

In [7]:

Copied!

sys1_name, sys2_name = 'af-p07550-f1-model_v4', 'af-a4d2n2-f1-model_v4'
sys1 = Luni(f'data/{sys1_name}.cif')
sys2 = Luni(f'data/{sys2_name}.cif')
sys1_name, sys2_name = 'af-p07550-f1-model_v4', 'af-a4d2n2-f1-model_v4'
sys1 = Luni(f'data/{sys1_name}.cif')
sys2 = Luni(f'data/{sys2_name}.cif')

In [8]:

Copied!

ns1 = sys1.compute_neighbors()
ns2 = sys2.compute_neighbors()
ns1 = sys1.compute_neighbors()
ns2 = sys2.compute_neighbors()

In [9]:

Copied!

ns1_mapped = ns1.map(parser.sequences[sys1_name])
ns2_mapped = ns2.map(parser.sequences[sys2_name])
ns1_mapped = ns1.map(parser.sequences[sys1_name])
ns2_mapped = ns2.map(parser.sequences[sys2_name])

Now we can perform any operation on the mapped neighbors, with the certainty that they are meaningful. For example, let's compute the intersection:

In [10]:

Copied!

# intersection 
ns1_mapped & ns2_mapped
# intersection 
ns1_mapped & ns2_mapped

Out[10]:

<Lahuta LabeledNeighborPairs class containing 1391 pairs>

Note how the return value is a LabeledNeighborPairs object, which is similar to a NeighborPairs object, but allows for a more customizable labeling of atoms

Understanding `LabeledNeighborPairs`¶

In [11]:

Copied!

ns1_mapped.pairs
ns1_mapped.pairs

Out[11]:

array([[('C', '41', 'MET'), ('N', '43', 'GLN')],
       [('O', '41', 'MET'), ('N', '43', 'GLN')],
       [('C', '42', 'GLY'), ('N', '44', 'PRO')],
       ...,
       [('C', '505', 'ASP'), ('N', '507', 'LEU')],
       [('C', '506', 'SER'), ('N', '508', 'LEU')],
       [('O', '506', 'SER'), ('N', '508', 'LEU')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

In [12]:

Copied!

ns2_mapped.pairs
ns2_mapped.pairs

Out[12]:

array([[('C', '31', 'MET'), ('N', '33', 'LEU')],
       [('O', '31', 'MET'), ('N', '33', 'LEU')],
       [('C', '32', 'ASP'), ('N', '34', 'PRO')],
       ...,
       [('C', '437', 'ARG'), ('O', '439', 'HIS')],
       [('O', '437', 'ARG'), ('N', '439', 'HIS')],
       [('O', '437', 'ARG'), ('O', '439', 'HIS')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

These are the atoms that are shared between the two systems:

In [13]:

Copied!

shared = ns1_mapped & ns2_mapped
shared.pairs
shared = ns1_mapped & ns2_mapped
shared.pairs

Out[13]:

array([[('N', '72', 'VAL'), ('N', '75', 'VAL')],
       [('CA', '72', 'VAL'), ('N', '75', 'VAL')],
       [('CA', '72', 'VAL'), ('CB', '75', 'VAL')],
       ...,
       [('O', '413', 'LEU'), ('N', '415', 'TYR')],
       [('O', '413', 'LEU'), ('CA', '415', 'TYR')],
       [('O', '413', 'LEU'), ('C', '415', 'TYR')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

Why do we have to use this new object? Why can't we just use the NeighborPairs object? The reason is that NeighborPairs stores atoms as an object tied to the read-in system. This means that many operations will not be defined (e.g. the union of two NeighborPairs objects will entail atoms from two different systems, which is not defined). The LabeledNeighborPairs object, on the other hand, stores atoms as a tuple of (atom_name, resids, resnames).

In [15]:

Copied!

shared.pairs['names']
shared.pairs['names']

Out[15]:

array([['N', 'N'],
       ['CA', 'N'],
       ['CA', 'CB'],
       ...,
       ['O', 'N'],
       ['O', 'CA'],
       ['O', 'C']], dtype='<U25')

In [16]:

Copied!

shared.pairs['resids']
shared.pairs['resids']

Out[16]:

array([['72', '75'],
       ['72', '75'],
       ['72', '75'],
       ...,
       ['413', '415'],
       ['413', '415'],
       ['413', '415']], dtype='<U25')

In [17]:

Copied!

shared.pairs['resnames']
shared.pairs['resnames']

Out[17]:

array([['VAL', 'VAL'],
       ['VAL', 'VAL'],
       ['VAL', 'VAL'],
       ...,
       ['LEU', 'TYR'],
       ['LEU', 'TYR'],
       ['LEU', 'TYR']], dtype='<U25')

Selections, Filters, and Operations¶

When mapping atoms, Lahuta produces a full mapping of all atoms in the system. We do, however, have quite a lot of control over what atoms and what information for each atom is stored. For example, let's say we do not care about residue names. This is quite sensible, since the MSA will align residues based on much more sophisticated criteria than their names.

Let's calculate the intersection again, but this time we'll only use the atom names and residue ids:

In [18]:

Copied!

ns1_mapped.remove('resnames') # remove resnames from ns1
ns1_mapped.remove('resnames') # remove resnames from ns1

Out[18]:

<Lahuta LabeledNeighborPairs class containing 13158 pairs>

In [19]:

Copied!

shared_no_resnames = ns1_mapped.remove('resnames') & ns2_mapped.remove('resnames')
shared_no_resnames.pairs
shared_no_resnames = ns1_mapped.remove('resnames') & ns2_mapped.remove('resnames')
shared_no_resnames.pairs

Out[19]:

array([[('C', '41', ''), ('N', '43', '')],
       [('O', '41', ''), ('N', '43', '')],
       [('C', '42', ''), ('N', '44', '')],
       ...,
       [('C', '437', ''), ('N', '439', '')],
       [('C', '437', ''), ('CA', '439', '')],
       [('O', '437', ''), ('N', '439', '')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

Note how the results also lack resnames information. Such operations, therefore, are lossy. Also note that you usually need to perform the same operation on both systems, otherwise you might get unexpected results.

For example, if we do not remove the resnames from the second system, we get zero overlap between contacts:

In [20]:

Copied!

ns1_mapped.remove('resnames') & ns2_mapped
ns1_mapped.remove('resnames') & ns2_mapped

Out[20]:

<Lahuta LabeledNeighborPairs class containing 0 pairs>

See the examples below for more details on how to perform different operations with `LabeledNeighborPairs`` objects¶

Remove names and resnames (leaving only resids)

In [21]:

Copied!

ns1_mapped.remove('names').remove('resnames').pairs
ns1_mapped.remove('names').remove('resnames').pairs

Out[21]:

array([[('', '41', ''), ('', '43', '')],
       [('', '42', ''), ('', '44', '')],
       [('', '43', ''), ('', '45', '')],
       ...,
       [('', '504', ''), ('', '506', '')],
       [('', '505', ''), ('', '507', '')],
       [('', '506', ''), ('', '508', '')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

Remove resnames only for MET & GLN residues:

In [22]:

Copied!

ns1_mapped.select(resnames=['MET', 'GLN']).remove('resnames').pairs
ns1_mapped.select(resnames=['MET', 'GLN']).remove('resnames').pairs

Out[22]:

array([[('C', '41', ''), ('N', '43', '')],
       [('O', '41', ''), ('N', '43', '')],
       [('C', '42', 'GLY'), ('N', '44', 'PRO')],
       ...,
       [('C', '505', 'ASP'), ('N', '507', 'LEU')],
       [('C', '506', 'SER'), ('N', '508', 'LEU')],
       [('O', '506', 'SER'), ('N', '508', 'LEU')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

Remove all names for all atoms except for MET and GLN residues.

Two ways to do this:

In [24]:

Copied!

# Using inverse
ns1_mapped.select(resnames=['MET', 'GLN']).inverse().remove('names').pairs
# Using inverse
ns1_mapped.select(resnames=['MET', 'GLN']).inverse().remove('names').pairs

Out[24]:

array([[('C', '41', 'MET'), ('N', '43', 'GLN')],
       [('O', '41', 'MET'), ('N', '43', 'GLN')],
       [('', '42', 'GLY'), ('', '44', 'PRO')],
       ...,
       [('', '504', 'ASN'), ('', '506', 'SER')],
       [('', '505', 'ASP'), ('', '507', 'LEU')],
       [('', '506', 'SER'), ('', '508', 'LEU')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

In [25]:

Copied!

# Using exclude
ns1_mapped.exclude(resnames=['MET', 'GLN']).remove('names').pairs
# Using exclude
ns1_mapped.exclude(resnames=['MET', 'GLN']).remove('names').pairs

Out[25]:

array([[('C', '41', 'MET'), ('N', '43', 'GLN')],
       [('O', '41', 'MET'), ('N', '43', 'GLN')],
       [('', '42', 'GLY'), ('', '44', 'PRO')],
       ...,
       [('', '504', 'ASN'), ('', '506', 'SER')],
       [('', '505', 'ASP'), ('', '507', 'LEU')],
       [('', '506', 'SER'), ('', '508', 'LEU')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

Remove all backbone atoms but only for protein residues

In [26]:

Copied!

protein_resnames = ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL']
ns1_mapped.select(resnames=protein_resnames).select(names=['CA', 'C', 'N', 'O']).remove('names').pairs[:15]
protein_resnames = ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL']
ns1_mapped.select(resnames=protein_resnames).select(names=['CA', 'C', 'N', 'O']).remove('names').pairs[:15]

Out[26]:

array([[('', '41', 'MET'), ('', '43', 'GLN')],
       [('', '42', 'GLY'), ('', '44', 'PRO')],
       [('', '43', 'GLN'), ('', '45', 'GLY')],
       [('', '44', 'PRO'), ('', '46', 'ASN')],
       [('', '45', 'GLY'), ('', '47', 'GLY')],
       [('', '46', 'ASN'), ('', '48', 'SER')],
       [('', '47', 'GLY'), ('', '49', 'ALA')],
       [('', '48', 'SER'), ('', '50', 'PHE')],
       [('', '48', 'SER'), ('CB', '50', 'PHE')],
       [('', '49', 'ALA'), ('', '51', 'LEU')],
       [('', '50', 'PHE'), ('', '52', 'LEU')],
       [('CD1', '50', 'PHE'), ('CG', '52', 'LEU')],
       [('CD1', '50', 'PHE'), ('CD2', '52', 'LEU')],
       [('', '51', 'LEU'), ('', '53', 'ALA')],
       [('', '52', 'LEU'), ('', '54', 'PRO')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

Here is a more complex example:

Select hydrophobic residues and rename them to HIDROPHOBIC

In [27]:

Copied!

# Note how we have to perform the selection twice
hydrophobic_resnames = ['ALA', 'ILE', 'LEU', 'MET', 'PHE', 'PRO', 'TRP', 'VAL']
ns1_mapped.select(resnames=hydrophobic_resnames).rename('resnames', lambda x: 'HIDROPHOBIC').select(resnames=['HIDROPHOBIC']).rename('names', lambda x: 'X').pairs
# Note how we have to perform the selection twice
hydrophobic_resnames = ['ALA', 'ILE', 'LEU', 'MET', 'PHE', 'PRO', 'TRP', 'VAL']
ns1_mapped.select(resnames=hydrophobic_resnames).rename('resnames', lambda x: 'HIDROPHOBIC').select(resnames=['HIDROPHOBIC']).rename('names', lambda x: 'X').pairs

Out[27]:

array([[('X', '41', 'HIDROPHOBIC'), ('N', '43', 'GLN')],
       [('C', '42', 'GLY'), ('X', '44', 'HIDROPHOBIC')],
       [('C', '43', 'GLN'), ('N', '45', 'GLY')],
       ...,
       [('C', '505', 'ASP'), ('X', '507', 'HIDROPHOBIC')],
       [('C', '506', 'SER'), ('X', '508', 'HIDROPHOBIC')],
       [('O', '506', 'SER'), ('X', '508', 'HIDROPHOBIC')]],
      dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])

Finally, Note that these selection and filtering operations are applied at the single label level. When we remove, for example, backbone atoms, we are esentially replacing the label with an empty string. We are, however, not dropping the entire contact. Lahuta also makes sure duplicate entries are removed.

Tracing Contacts Across Evolutionary Time¶

Understanding LabeledNeighborPairs¶

Selections, Filters, and Operations¶

See the examples below for more details on how to perform different operations with `LabeledNeighborPairs`` objects¶

Understanding `LabeledNeighborPairs`¶