Tracing Contacts Across Evolutionary Time¶
We now have all the incredients to compute contacts between different proteins and meaningfully
compare them. This is a very powerful feature as it allows us to use the feature rich operations supported by NeighborPairs
objects and apply them to contacts between different proteins. All thanks to the Multiple Sequence Alignment (MSA) which serves as a common reference frame for all proteins.
There are two things to keep in mind:
- We still have a difference of resolutions. The MSA is computed at the residue level, but contacts are computed at the atom level. This means we are on our own when it comes to mapping atoms for each residue. Luckily,
Lahuta
helps us with this as well. - MSA are only meaningful for proteins, whereas biomolecular systems (input files) can and usually do contain other molecules such as DNA, RNA, ligands, etc. This means that we can only use the MSA to map protein residues. Again,
Lahuta
helps us with this as well.
How does Lahuta
handle these two issues?
- When mapping atoms using a provided sequence,
Lahuta
will produce aLabeledNeighborPairs
object, rather than aNeighborPairs
object. This new object is better suited for comparing contacts between different proteins as it provides several convenience methods. We'll go over the specifics in a bit. - MSA are only meaningful for proteins and nucleic acids.
Lahuta
performs two different types of mappings:proteins
are mapped according to the provided sequence.non proteins
are mapped by shifting their residue index by the number of residues in the protein.
Let's see how this works in practice.
from lahuta import Luni
from lahuta.msa import MSAParser
from lahuta.api import CachedFileProcessor
def func(file_path: str) -> str:
luni = Luni(file_path)
return luni.sequence
processor = CachedFileProcessor(
file_list=['data/af-a4d2n2-f1-model_v4.cif', 'data/af-o43613-f1-model_v4.cif'],
worker=func,
)
processor.process()
processor.results
{'af-a4d2n2-f1-model_v4.cif': 'MDLPVNLTSFSLSTPSPLETNHSLGKDDLRPSSPLLSVFGVLILTLLGFLVAATFAWNLLVLATILRVRTFHRVPHNLVASMAVSDVLVAALVMPLSLVHELSGRRWQLGRRLCQLWIACDVLCCTASIWNVTAIALDRYWSITRHMEYTLRTRKCVSNVMIALTWALSAVISLAPLLFGWGETYSEGSEECQVSREPSYAVFSTVGAFYLPLCVVLFVYWKIYKAAKFRVGSRKTNSVSPISEAVEVKDSAKQPQMVFTVRHATVTFQPEGDTWREQKEQRAALMVGILIGVFVLCWIPFFLTELISPLCSCDIPAIWKSIFLWLGYSNSFFNPLIYTAFNKNYNSAFKNFFSRQH', 'af-o43613-f1-model_v4.cif': 'MEPSATPGAQMGVPPGSREPSPVPPDYEDEFLRYLWRDYLYPKQYEWVLIAAYVAVFVVALVGNTLVCLAVWRNHHMRTVTNYFIVNLSLADVLVTAICLPASLLVDITESWLFGHALCKVIPYLQAVSVSVAVLTLSFIALDRWYAICHPLLFKSTARRARGSILGIWAVSLAIMVPQAAVMECSSVLPELANRTRLFSVCDERWADDLYPKIYHSCFFIVTYLAPLGLMAMAYFQIFRKLWGRQIPGTTSALVRNWKRPSDQLGDLEQGLSGEPQPRARAFLAEVKQMRARRKTAKMLMVVLLVFALCYLPISVLNVLKRVFGMFRQASDREAVYACFTFSHWLVYANSAANPIIYNFLSGKFREQFKAAFSCCLPGLGPCGSLKAPSPRSSASHKSLSLQSRCSISKISEHVVLTSVTTVLP'}
parser = MSAParser('data/custom_alignment.fasta')
parser.sequences
{'af-p07550-f1-model_v4': Seq('-----------------------------------------MGQPGNGSAFLLA...SLL'), 'af-a4d2n2-f1-model_v4': Seq('-------------------------------MDLPVNLTSFSLSTPSPLETNHS...---'), 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...---'), 'af-o43613-f1-model_v4': Seq('MEPSA---------TPGAQMGVPPGSR------------EPSPVPPDYED-EFL...---'), 'af-p61073-f1-model_v4': Seq('-------------------------------------MEGISIYTSDNYTEEMG...---')}
sys1_name, sys2_name = 'af-p07550-f1-model_v4', 'af-a4d2n2-f1-model_v4'
sys1 = Luni(f'data/{sys1_name}.cif')
sys2 = Luni(f'data/{sys2_name}.cif')
ns1 = sys1.compute_neighbors()
ns2 = sys2.compute_neighbors()
ns1_mapped = ns1.map(parser.sequences[sys1_name])
ns2_mapped = ns2.map(parser.sequences[sys2_name])
Now we can perform any operation on the mapped neighbors, with the certainty that they are meaningful. For example, let's compute the intersection:
# intersection
ns1_mapped & ns2_mapped
<Lahuta LabeledNeighborPairs class containing 1391 pairs>
Note how the return value is a LabeledNeighborPairs
object, which is similar to a NeighborPairs
object, but allows for a more customizable labeling of atoms
Understanding LabeledNeighborPairs
¶
ns1_mapped.pairs
array([[('C', '41', 'MET'), ('N', '43', 'GLN')], [('O', '41', 'MET'), ('N', '43', 'GLN')], [('C', '42', 'GLY'), ('N', '44', 'PRO')], ..., [('C', '505', 'ASP'), ('N', '507', 'LEU')], [('C', '506', 'SER'), ('N', '508', 'LEU')], [('O', '506', 'SER'), ('N', '508', 'LEU')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
ns2_mapped.pairs
array([[('C', '31', 'MET'), ('N', '33', 'LEU')], [('O', '31', 'MET'), ('N', '33', 'LEU')], [('C', '32', 'ASP'), ('N', '34', 'PRO')], ..., [('C', '437', 'ARG'), ('O', '439', 'HIS')], [('O', '437', 'ARG'), ('N', '439', 'HIS')], [('O', '437', 'ARG'), ('O', '439', 'HIS')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
These are the atoms that are shared between the two systems:
shared = ns1_mapped & ns2_mapped
shared.pairs
array([[('N', '72', 'VAL'), ('N', '75', 'VAL')], [('CA', '72', 'VAL'), ('N', '75', 'VAL')], [('CA', '72', 'VAL'), ('CB', '75', 'VAL')], ..., [('O', '413', 'LEU'), ('N', '415', 'TYR')], [('O', '413', 'LEU'), ('CA', '415', 'TYR')], [('O', '413', 'LEU'), ('C', '415', 'TYR')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
Why do we have to use this new object? Why can't we just use the NeighborPairs
object? The reason is that NeighborPairs
stores atoms as an object tied to the read-in system. This means that many operations will not be defined (e.g. the union of two NeighborPairs
objects will entail atoms from two different systems, which is not defined). The LabeledNeighborPairs
object, on the other hand, stores atoms as a tuple of (atom_name, resids, resnames)
.
shared.pairs['names']
array([['N', 'N'], ['CA', 'N'], ['CA', 'CB'], ..., ['O', 'N'], ['O', 'CA'], ['O', 'C']], dtype='<U25')
shared.pairs['resids']
array([['72', '75'], ['72', '75'], ['72', '75'], ..., ['413', '415'], ['413', '415'], ['413', '415']], dtype='<U25')
shared.pairs['resnames']
array([['VAL', 'VAL'], ['VAL', 'VAL'], ['VAL', 'VAL'], ..., ['LEU', 'TYR'], ['LEU', 'TYR'], ['LEU', 'TYR']], dtype='<U25')
Selections, Filters, and Operations¶
When mapping atoms, Lahuta
produces a full mapping of all atoms in the system. We do, however, have quite a lot of control over what atoms and what information for each atom is stored. For example, let's say we do not care about residue names. This is quite sensible, since the MSA will align residues based on much more sophisticated criteria than their names.
Let's calculate the intersection again, but this time we'll only use the atom names and residue ids:
ns1_mapped.remove('resnames') # remove resnames from ns1
<Lahuta LabeledNeighborPairs class containing 13158 pairs>
shared_no_resnames = ns1_mapped.remove('resnames') & ns2_mapped.remove('resnames')
shared_no_resnames.pairs
array([[('C', '41', ''), ('N', '43', '')], [('O', '41', ''), ('N', '43', '')], [('C', '42', ''), ('N', '44', '')], ..., [('C', '437', ''), ('N', '439', '')], [('C', '437', ''), ('CA', '439', '')], [('O', '437', ''), ('N', '439', '')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
Note how the results also lack resnames
information. Such operations, therefore, are lossy. Also note that you usually need to perform the same operation on both systems, otherwise you might get unexpected results.
For example, if we do not remove the resnames
from the second system, we get zero overlap between contacts:
ns1_mapped.remove('resnames') & ns2_mapped
<Lahuta LabeledNeighborPairs class containing 0 pairs>
See the examples below for more details on how to perform different operations with `LabeledNeighborPairs`` objects¶
Remove names
and resnames
(leaving only resids
)
ns1_mapped.remove('names').remove('resnames').pairs
array([[('', '41', ''), ('', '43', '')], [('', '42', ''), ('', '44', '')], [('', '43', ''), ('', '45', '')], ..., [('', '504', ''), ('', '506', '')], [('', '505', ''), ('', '507', '')], [('', '506', ''), ('', '508', '')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
Remove resnames
only for MET & GLN residues:
ns1_mapped.select(resnames=['MET', 'GLN']).remove('resnames').pairs
array([[('C', '41', ''), ('N', '43', '')], [('O', '41', ''), ('N', '43', '')], [('C', '42', 'GLY'), ('N', '44', 'PRO')], ..., [('C', '505', 'ASP'), ('N', '507', 'LEU')], [('C', '506', 'SER'), ('N', '508', 'LEU')], [('O', '506', 'SER'), ('N', '508', 'LEU')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
Remove all names
for all atoms except for MET
and GLN
residues.
Two ways to do this:
# Using inverse
ns1_mapped.select(resnames=['MET', 'GLN']).inverse().remove('names').pairs
array([[('C', '41', 'MET'), ('N', '43', 'GLN')], [('O', '41', 'MET'), ('N', '43', 'GLN')], [('', '42', 'GLY'), ('', '44', 'PRO')], ..., [('', '504', 'ASN'), ('', '506', 'SER')], [('', '505', 'ASP'), ('', '507', 'LEU')], [('', '506', 'SER'), ('', '508', 'LEU')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
# Using exclude
ns1_mapped.exclude(resnames=['MET', 'GLN']).remove('names').pairs
array([[('C', '41', 'MET'), ('N', '43', 'GLN')], [('O', '41', 'MET'), ('N', '43', 'GLN')], [('', '42', 'GLY'), ('', '44', 'PRO')], ..., [('', '504', 'ASN'), ('', '506', 'SER')], [('', '505', 'ASP'), ('', '507', 'LEU')], [('', '506', 'SER'), ('', '508', 'LEU')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
Remove all backbone
atoms but only for protein residues
protein_resnames = ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL']
ns1_mapped.select(resnames=protein_resnames).select(names=['CA', 'C', 'N', 'O']).remove('names').pairs[:15]
array([[('', '41', 'MET'), ('', '43', 'GLN')], [('', '42', 'GLY'), ('', '44', 'PRO')], [('', '43', 'GLN'), ('', '45', 'GLY')], [('', '44', 'PRO'), ('', '46', 'ASN')], [('', '45', 'GLY'), ('', '47', 'GLY')], [('', '46', 'ASN'), ('', '48', 'SER')], [('', '47', 'GLY'), ('', '49', 'ALA')], [('', '48', 'SER'), ('', '50', 'PHE')], [('', '48', 'SER'), ('CB', '50', 'PHE')], [('', '49', 'ALA'), ('', '51', 'LEU')], [('', '50', 'PHE'), ('', '52', 'LEU')], [('CD1', '50', 'PHE'), ('CG', '52', 'LEU')], [('CD1', '50', 'PHE'), ('CD2', '52', 'LEU')], [('', '51', 'LEU'), ('', '53', 'ALA')], [('', '52', 'LEU'), ('', '54', 'PRO')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
Here is a more complex example:
Select hydrophobic residues and rename them to HIDROPHOBIC
# Note how we have to perform the selection twice
hydrophobic_resnames = ['ALA', 'ILE', 'LEU', 'MET', 'PHE', 'PRO', 'TRP', 'VAL']
ns1_mapped.select(resnames=hydrophobic_resnames).rename('resnames', lambda x: 'HIDROPHOBIC').select(resnames=['HIDROPHOBIC']).rename('names', lambda x: 'X').pairs
array([[('X', '41', 'HIDROPHOBIC'), ('N', '43', 'GLN')], [('C', '42', 'GLY'), ('X', '44', 'HIDROPHOBIC')], [('C', '43', 'GLN'), ('N', '45', 'GLY')], ..., [('C', '505', 'ASP'), ('X', '507', 'HIDROPHOBIC')], [('C', '506', 'SER'), ('X', '508', 'HIDROPHOBIC')], [('O', '506', 'SER'), ('X', '508', 'HIDROPHOBIC')]], dtype=[('names', '<U25'), ('resids', '<U25'), ('resnames', '<U25')])
Finally, Note that these selection and filtering operations are applied at the single label level. When we remove, for example, backbone atoms, we are esentially replacing the label with an empty string. We are, however, not dropping the entire contact. Lahuta
also makes sure duplicate entries are removed.