MSA
Contents
MSA¶
What is an MSA?¶
An MSA (multiple sequence alignment) is a file that contains a sequence of amino acids and a number of aligned variant sequences. The aim is to stored information about the evolution of the main sequence. Using this for training allows to reduce the number of model parameters, and use evolution information in the model as explained in the MSA Transformers papers.
The MSA are generated using hh-suite tools, searching similar sequences in the UniClust30 database.
Note
The generated MSA must have the .a3m extension to be used in bio-transformers
Below, you can see an example of an MSA file. The first line is the main sequence. The following sequences are the most probable variants sequences. The - character in the msa sequences corresponds to the alignment character.
>> head msa_example.a3m
>sp|O24396|PURA_WHEAT Adenylosuccinate synthetase, chloroplastic (Fragment) OS=Triticum aestivum OX=4565 PE=1 SV=1
AAAAAGRGRSFSPAAPAPSSVRLPGRQAPAPAAASALAVEADPAADRVSSLSQVSGVLGSQWGDEGKGKLVDVLAPRFDIVARCQGGANAGHTIYNSEGKKFALHLVPSGILHEGTLCVVGNGAVIHVPGFFGEIDGLQSNGVSCDGRILVSDRAHLLFDLHQTVDGLREAELANSFIGTTKRGIGPCYSSKVTRNGLRVCDLRHMDTFGDKLDVLFEDAAARFEGFKYSKGMLKEEVERYKRFAERLEPFIADTVHVLNESIRQKKKILVEGGQATMLDIDFGTYPFVTSSSPSAGGICTGLGIAPRVIGDLIGVVKAYTTRVGSGPFPTELLGEEGDVLRKAGMEFGTTTGRPRRCGWLDIVALKYCCDINGFSSLNLTKLDVLSGLPEIKLGVSYNQMDGEKLQSFPGDLDTLEQVQVNYEVLPGWDSDISSVRSYSELPQAARRYVERIEELAGVPVHYIGVGPGRDALIYK
>UniRef100_A0A0N4UWB0 Adenylosuccinate synthetase n=1 Tax=Enterobius vermicularis TaxID=51028 RepID=A0A0N4UWB0_ENTVE
--------------------------------------------MNDQKRKAPVIVILGAQFGDEGKGKIVDFLIEKekIQLTARCQGGNNAGHTVV-VNGRKSDFHLLPTGIINEDCYNIIGNGVVVNLDALFKEIEHNEIDKLNgWEKRLMISELAHLVTSMHMQADGQQEKSLSSEKIGTTSKGIGPTYSTKCFRNGIRVGELlGDFEAFSAKFRSLAAFYLKQFPGIEVN---VEEELDNYKKHAVCLKRLgiVGDTITYLDEMRAQGKAILVEGANGAMLDIDFGsflytffchsgTYPFVTSSNATVGGAVTGLGIPPTAITEIIGVVKAYETRVGSGPFPTEQQGKIGEDLQSIGHEVGVTTGRKRRCGWLDLFLLKRSSVINGFTALALTKLDILDNFDEIKVATGYR-IDGKSLKAPPSCAADWSRIELEYKTFSGWKDDVSKIRSFNELPENCKTYVKFIEGFVGVPIKWIGVGEDREALIVM
>UniRef100_A0A139AVD1 Adenylosuccinate synthetase n=1 Tax=Gonapodya prolifera (strain JEL478) TaxID=1344416 RepID=A0A139AVD1_GONPJ
-----------------------------------------ATG-------NKAVVVLGAQWGDEGKGKLVDILTQQADLVARCQGGNNAGHTIV-VDGVKFDFHMLPSGLLGaPSTVSLVGSGVVLHLPSFFEEVKKTESKGVSCANRLFVSDRCHLVFDLHQIVDGLKEGELAShkQEIGTTKKGIGPAYSSKASRGGVRVHHLiaPDFAEFESRFRQMAANKKRRYGDFPYD---VDAEVERYRQYRDLIRPYVVDSVTYVHKALQEGKRVLVEGANAVMLDIDFGTFPYVTSSNTTIGGVCTGLGLPPKSIGKVIGVVKAYTTRVGAGPFPTEQLNEVGEHLQTVGAEFGVTTGRKRRCGWLDAAVLRWSHMINGYDSINLTKLDILDGLPTLRIGIAYKHrATGQVYETFPADLHLLEECDVIYEELPGWKESIGGCKSWDALPENARKYVERIEQLVGVNVEYIGVGVSRDSMITK
Caution
finetune method for MSA transormers and pass_mode=masked are not available.
How to use msa-transformers?¶
The library supports esm_msa1_t12_100M_UR50S backend which is based on the MSA Transformers papers.
Instead of passing a list of sequences of the path of a fasta file, you can pass the path to the folder where the .a3m files are stored.
from biotransformers import BioTransformers
msa_folder = "msa_folder"
bio_trans = BioTransformers("esm_msa1_t12_100M_UR50S",num_gpus=1)
msa_embeddings = bio_trans.compute_embeddings(sequences=msa_folder, pool_mode=("cls","mean"), n_seqs_msa=128)
As an MSA is composed of multiple sequences, the results have an extra-dimension, which corresponds to all the aligned sequences.
msa_embeddings['cls'].shape
If 100 msa files are in the folder, we have the following dimension for the embeddings. We take le <CLS> token of each of the 128 sequences in the files.
>> (100, 128, 768)
Caution
All files in the msa folder must have at least n_seqs_msa sequences in each MSA. If you want to remove files with less than n_seqs_msa arguments, you can use the biotransformers.utils.msa_utils.msa_to_remove function.
msa_to_remove("data_msa_sample/", n_seq=128)
>>
3/8 have insufficient number of sequences in MSA.
['data/data_msa_sample/seq130_swissprot.a3m',
'data/data_msa_sample/seq109_swissprot.a3m',
'data/data_msa_sample/seq124_swissprot.a3m']