Working with Multiple Sequence Alignment (MSA)¶

Lahuta has built-in support for complex operations on NeighborPairs objects derived from different proteins.

To achieve this, Lahuta needs to map the indices of the atoms in NeighborPairs objects to corresponding indices derived from a Multiple Sequence Alignmet (MSA). The basic idea is that the MSA provides a mapping between different proteins. Importantly, this mapping is at the level of amino acids, not atoms. Contacts, however, are defined at the level of atoms.

There are many things that happen in the background, however, most of the details are abstracted away. Lahuta provides a simple API to achieve this. This API is quite flexible and allows for many different use cases. The previous notebook described how to easily get the protein sequences from list of input files or a directory. This notebook will describe built-in support for handling MSAs. This includes reading/parsing MSAs, creating a MSA alignment from read sequences, and even using a custom MSA.

Reading/Parsing MSA files¶

Let's read the fasta file we saved in the previous notebook. To read/parse a fasta file, Lahuta provides the MSAParser class.

In [2]:

Copied!

from lahuta.msa import MSAParser
from lahuta.msa import MSAParser

In [3]:

Copied!

parser = MSAParser('data/sequences.fasta')
parser.sequences
parser = MSAParser('data/sequences.fasta')
parser.sequences

Out[3]:

{'af-p07550-f1-model_v4': Seq('MGQPGNGSAFLLAPNGSHAPDHDVTQERDEVWVVGMGIVMSLIVLAIVFGNVLV...SLL'),
 'af-a4d2n2-f1-model_v4': Seq('MDLPVNLTSFSLSTPSPLETNHSLGKDDLRPSSPLLSVFGVLILTLLGFLVAAT...RQH'),
 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...PLP'),
 'af-o43613-f1-model_v4': Seq('MEPSATPGAQMGVPPGSREPSPVPPDYEDEFLRYLWRDYLYPKQYEWVLIAAYV...VLP'),
 'af-p61073-f1-model_v4': Seq('MEGISIYTSDNYTEEMGSGDYDSMKEPCFREENANFNKIFLPTIYSIIFLTGIV...HSS')}

Note that these sequences are not aligned yet. You can align them yourself or use the alignment provided by the parser.

Let's align the sequences using MAFFT. This will give us a new parser object with aligned sequences.

In [4]:

Copied!

# suppress info logging
alig_parser = parser.align(backend='mafft', n_jobs=4) # backend: 'mafft' or 'muscle'. Default: 'mafft'
alig_parser.sequences
# suppress info logging
alig_parser = parser.align(backend='mafft', n_jobs=4) # backend: 'mafft' or 'muscle'. Default: 'mafft'
alig_parser.sequences

INFO:root:Alignment completed. Output file located at: /tmp/tmpqax0olmm

Out[4]:

{'af-p07550-f1-model_v4': Seq('----------------MG---------QPGNGSAFLLA-----PNGSHAPDH--...LL-'),
 'af-a4d2n2-f1-model_v4': Seq('----------------MD---------LPVNLTSFSLS-----TPSPLETNHS-...---'),
 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALA---YSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTD...---'),
 'af-o43613-f1-model_v4': Seq('MEPSATPG------AQMG---------VPPGSRE----------PSPVPPDYED...VLP'),
 'af-p61073-f1-model_v4': Seq('MEGISIYT-SDNYTEEMGSGDYDS-------------------MKEPC--FREE...---')}

Using your own MSAs¶

Let's read in a custom alignment:

In [5]:

Copied!

# load pre-prepared alignment
prep_alig = MSAParser('data/gpcrdb_sample.fasta')
# load pre-prepared alignment
prep_alig = MSAParser('data/gpcrdb_sample.fasta')

In [6]:

Copied!

# align gpcr sequences using the custom alignment as reference
custom_alignment = parser.align(ref_alignment=prep_alig.sequences, n_jobs=2)
custom_alignment.sequences
# align gpcr sequences using the custom alignment as reference
custom_alignment = parser.align(ref_alignment=prep_alig.sequences, n_jobs=2)
custom_alignment.sequences

INFO:root:Alignment completed. Output file located at: /tmp/tmppue89mev

Out[6]:

{'5ht5a_human': Seq('-------------------------------MDLPVNLTSFSLSTPSPLETNHS...---'),
 'adrb2_human': Seq('-----------------------------------------MGQPGNGSAFLLA...SLL'),
 'oprm_human': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...---'),
 'ox2r_human': Seq('--------------MSGTKLEDSPPCRNWSSASELNETQEPFLNPTDYDDEEFL...---'),
 'cxcr4_human': Seq('-------------------------------------MEGISIYTSDNYTEEMG...---'),
 'af-p07550-f1-model_v4': Seq('-----------------------------------------MGQPGNGSAFLLA...SLL'),
 'af-a4d2n2-f1-model_v4': Seq('-------------------------------MDLPVNLTSFSLSTPSPLETNHS...---'),
 'af-p35372-f1-model_v4': Seq('MDSSAAPTNASNCTDALAYSSCSPAPSPGSWVNLSHLDGNLSDPCGPNRTDLGG...---'),
 'af-o43613-f1-model_v4': Seq('MEPSA---------TPGAQMGVPPGSR------------EPSPVPPDYED-EFL...---'),
 'af-p61073-f1-model_v4': Seq('-------------------------------------MEGISIYTSDNYTEEMG...---')}

To get only the aligned sequences (without the reference sequence) we simply substract the relevant objects.
Let's do that and save the output to a fasta file, which we'll need in the next tutorial.

In [7]:

Copied!

custom_alig_diff = (custom_alignment - prep_alig)
custom_alig_diff.save('data/custom_alignment.fasta')
custom_alig_diff = (custom_alignment - prep_alig)
custom_alig_diff.save('data/custom_alignment.fasta')

Both addition (union) and subtraction (difference) of two MSAParser objects are supported.
The result is a new MSAParser object.

Controlling the alignment process¶

The syntax above is quite convenient and should be sufficient for most use cases.

However, if you want to have more control over the alignmet process you can directly use the underlying backends. Lahuta provides two backends: Mafft and Muscle. These are essentially wrappers around the command line tools. They are not intended to be feature complete. However, they should be sufficient for most use cases.

In [8]:

Copied!

from lahuta.msa import Mafft, Muscle
from lahuta.msa import Mafft, Muscle

Mafft¶

In [9]:

Copied!





mafft = Mafft(parser.sequences)
mafft.run(n_jobs=1)
# uncomment to see the result
# print(mafft.result)
mafft = Mafft(parser.sequences)
mafft.run(n_jobs=1)
# uncomment to see the result
# print(mafft.result)

INFO:root:Alignment completed. Output file located at: /tmp/tmpiwssfwk_

In [10]:

Copied!

# Get the command used to run mafft
mafft.command
# Get the command used to run mafft
mafft.command

Out[10]:

'mafft --thread 1 /tmp/tmpq_6dvpah'

Muscle¶

In [11]:

Copied!





muscle = Muscle(parser.sequences)
muscle.run()
# uncomment to see the result
# print(muscle.result)
muscle = Muscle(parser.sequences)
muscle.run()
# uncomment to see the result
# print(muscle.result)

INFO:root:Alignment completed. Output file located at: /tmp/tmpojbhtoz5_aligned.fasta

In [12]:

Copied!

# Get the command used to run muscle
muscle.command
# Get the command used to run muscle
muscle.command

Out[12]:

'muscle -align /tmp/tmpojbhtoz5 -output /tmp/tmpojbhtoz5_aligned.fasta'

Setting Advanced Parameters¶

It is possible to also provide advanced parameters to the backends. Again, we note that the backends are not feature complete, however, with the ability to set advanced parameters, you should be able to achieve most use cases.

In [13]:

Copied!





muscle = Muscle(parser.sequences)
muscle.set_advanced_alignment(strategy="stratified", replicates=4)
muscle.run()
# print(muscle.result)
muscle = Muscle(parser.sequences)
muscle.set_advanced_alignment(strategy="stratified", replicates=4)
muscle.run()
# print(muscle.result)

INFO:root:Alignment completed. Output file located at: /tmp/tmpwyvann6n_aligned.fasta

Setting Advanced Parameters for MAFFT¶

In [14]:

Copied!





mafft = Mafft(parser.sequences)
mafft.set_advanced_options('auto', maxiterate='1000', retree='2')
mafft.run()
# print(mafft.result)
mafft = Mafft(parser.sequences)
mafft.set_advanced_options('auto', maxiterate='1000', retree='2')
mafft.run()
# print(mafft.result)

INFO:root:Alignment completed. Output file located at: /tmp/tmpbj_l2erw

Next notebook¶

In the next notebook we will put everything together.

We will combine our knowledge of NeighborPairs and MSAParser to perform complex operations on NeighborPairs objects derived from different proteins.