Working with Multiple Sequence Alignment (MSA)¶
Lahuta
has built-in support for complex operations on NeighborPairs objects derived from different proteins.
To achieve this, Lahuta
needs to map the indices of the atoms in NeighborPairs
objects to corresponding indices derived from a Multiple Sequence Alignmet (MSA). The basic idea is that the MSA provides a mapping between different proteins. Importantly, this mapping is at the level of amino acids, not atoms. Contacts, however, are defined at the level of atoms.
There are many things that happen in the background, however, most of the details are abstracted away. Lahuta
provides a simple API to achieve this. This API is quite flexible and allows for many different use cases.
The previous notebook described how to easily get the protein sequences from list of input files or a directory. This notebook will describe built-in support for handling MSAs. This includes reading/parsing MSAs, creating a MSA alignment from read sequences, and even using a custom MSA.
Reading/Parsing MSA files¶
Let's read the fasta file we saved in the previous notebook. To read/parse a fasta file, Lahuta
provides the MSAParser
class.
from lahuta.msa import MSAParser
parser = MSAParser('data/sequences.fasta')
parser.sequences
{'af-p07550-f1-model_v4': Seq('MGQPGNGSAFLLAPNGSHAPDHDVTQERDEVWVVGMGIVMSLIVLAIVFGNVLV...SLL'), 'af-a4d2n2-f1-model_v4': Seq('MDLPVNLTSFSLSTPSPLETNHSLGKDDLRPSSPLLSVFGVLILTLLGFLVAAT...RQH'), 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...PLP'), 'af-o43613-f1-model_v4': Seq('MEPSATPGAQMGVPPGSREPSPVPPDYEDEFLRYLWRDYLYPKQYEWVLIAAYV...VLP'), 'af-p61073-f1-model_v4': Seq('MEGISIYTSDNYTEEMGSGDYDSMKEPCFREENANFNKIFLPTIYSIIFLTGIV...HSS')}
Note that these sequences are not aligned yet. You can align them yourself or use the alignment provided by the parser.
Let's align the sequences using MAFFT
. This will give us a new parser object with aligned sequences.
# suppress info logging
alig_parser = parser.align(backend='mafft', n_jobs=4) # backend: 'mafft' or 'muscle'. Default: 'mafft'
alig_parser.sequences
INFO:root:Alignment completed. Output file located at: /tmp/tmpqax0olmm
{'af-p07550-f1-model_v4': Seq('----------------MG---------QPGNGSAFLLA-----PNGSHAPDH--...LL-'), 'af-a4d2n2-f1-model_v4': Seq('----------------MD---------LPVNLTSFSLS-----TPSPLETNHS-...---'), 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALA---YSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTD...---'), 'af-o43613-f1-model_v4': Seq('MEPSATPG------AQMG---------VPPGSRE----------PSPVPPDYED...VLP'), 'af-p61073-f1-model_v4': Seq('MEGISIYT-SDNYTEEMGSGDYDS-------------------MKEPC--FREE...---')}
Using your own MSAs¶
Let's read in a custom alignment:
# load pre-prepared alignment
prep_alig = MSAParser('data/gpcrdb_sample.fasta')
# align gpcr sequences using the custom alignment as reference
custom_alignment = parser.align(ref_alignment=prep_alig.sequences, n_jobs=2)
custom_alignment.sequences
INFO:root:Alignment completed. Output file located at: /tmp/tmppue89mev
{'5ht5a_human': Seq('-------------------------------MDLPVNLTSFSLSTPSPLETNHS...---'), 'adrb2_human': Seq('-----------------------------------------MGQPGNGSAFLLA...SLL'), 'oprm_human': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...---'), 'ox2r_human': Seq('--------------MSGTKLEDSPPCRNWSSASELNETQEPFLNPTDYDDEEFL...---'), 'cxcr4_human': Seq('-------------------------------------MEGISIYTSDNYTEEMG...---'), 'af-p07550-f1-model_v4': Seq('-----------------------------------------MGQPGNGSAFLLA...SLL'), 'af-a4d2n2-f1-model_v4': Seq('-------------------------------MDLPVNLTSFSLSTPSPLETNHS...---'), 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...---'), 'af-o43613-f1-model_v4': Seq('MEPSA---------TPGAQMGVPPGSR------------EPSPVPPDYED-EFL...---'), 'af-p61073-f1-model_v4': Seq('-------------------------------------MEGISIYTSDNYTEEMG...---')}
To get only the aligned sequences (without the reference sequence) we simply substract the relevant objects.
Let's do that and save the output to a fasta file, which we'll need in the next tutorial.
custom_alig_diff = (custom_alignment - prep_alig)
custom_alig_diff.save('data/custom_alignment.fasta')
Both addition (union) and subtraction (difference) of two MSAParser objects are supported.
The result is a new MSAParser object.
Controlling the alignment process¶
The syntax above is quite convenient and should be sufficient for most use cases.
However, if you want to have more control over the alignmet process you can directly use the underlying backends. Lahuta
provides two backends: Mafft
and Muscle
. These are essentially wrappers around the command line tools. They are not intended to be feature complete. However, they should be sufficient for most use cases.
from lahuta.msa import Mafft, Muscle
Mafft¶
mafft = Mafft(parser.sequences)
mafft.run(n_jobs=1)
# uncomment to see the result
# print(mafft.result)
INFO:root:Alignment completed. Output file located at: /tmp/tmpiwssfwk_
# Get the command used to run mafft
mafft.command
'mafft --thread 1 /tmp/tmpq_6dvpah'
Muscle¶
muscle = Muscle(parser.sequences)
muscle.run()
# uncomment to see the result
# print(muscle.result)
INFO:root:Alignment completed. Output file located at: /tmp/tmpojbhtoz5_aligned.fasta
# Get the command used to run muscle
muscle.command
'muscle -align /tmp/tmpojbhtoz5 -output /tmp/tmpojbhtoz5_aligned.fasta'
Setting Advanced Parameters¶
It is possible to also provide advanced parameters to the backends. Again, we note that the backends are not feature complete, however, with the ability to set advanced parameters, you should be able to achieve most use cases.
muscle = Muscle(parser.sequences)
muscle.set_advanced_alignment(strategy="stratified", replicates=4)
muscle.run()
# print(muscle.result)
INFO:root:Alignment completed. Output file located at: /tmp/tmpwyvann6n_aligned.fasta
Setting Advanced Parameters for MAFFT¶
mafft = Mafft(parser.sequences)
mafft.set_advanced_options('auto', maxiterate='1000', retree='2')
mafft.run()
# print(mafft.result)
INFO:root:Alignment completed. Output file located at: /tmp/tmpbj_l2erw
Next notebook¶
In the next notebook we will put everything together.
We will combine our knowledge of NeighborPairs
and MSAParser
to perform complex operations on NeighborPairs
objects derived from different proteins.