biotransformers.wrappers.transformers_wrappers

This script defines a parent class for transformers, for which child classes which are specific to a given transformers implementation can inherit. It allows to derive probabilities, embeddings and log-likelihoods based on inputs sequences, and displays some properties of the transformer model.

Module Contents

Classes

TransformersWrapper

Abstract class that uses pretrained transformers model to evaluate

Attributes

log

PathMsaFolder

TokenProbsDict

SequenceProbsList

biotransformers.wrappers.transformers_wrappers.log
biotransformers.wrappers.transformers_wrappers.PathMsaFolder
biotransformers.wrappers.transformers_wrappers.TokenProbsDict
biotransformers.wrappers.transformers_wrappers.SequenceProbsList
class biotransformers.wrappers.transformers_wrappers.TransformersWrapper(model_dir: str, language_model_cls: Type[biotransformers.wrappers.language_model.LanguageModel], num_gpus: int = 0)

Abstract class that uses pretrained transformers model to evaluate a protein likelihood so as other insights.

init_ray_workers(self)

Initialization of ray workers

delete_ray_workers(self)

Delete ray workers to free RAM

get_vocabulary_mask(self, tokens_list: List[str])numpy.ndarray

Returns a mask ove the model tokens.

_get_num_batch_iter(self, model_inputs: Dict[str, Any], batch_size: int)int

Get the number of batches when spliting model_inputs into chunks of size batch_size.

_generate_chunks(self, model_inputs: Dict[str, Any], batch_size: int)Generator[Dict[str, Iterable], None, None]

Yield a dictionnary of tensor

_repeat_and_mask_inputs(self, model_inputs: Dict[str, torch.Tensor])Tuple[Dict[str, torch.Tensor], List[List]]

Create new tensor by masking each token and repeating sequence

Parameters

model_inputs – shape -> (num_seqs, max_seq_len)

Returns

shape -> (sum_tokens, max_seq_len) masked_ids_list: len -> (num_seqs)

Return type

model_inputs

_mask_inputs_tokens(self, model_inputs: Dict[str, torch.Tensor], token_position: Optional[List[int]])Dict[str, torch.Tensor]

Create new tensor by masking a specific token

Parameters
  • model_inputs (Dict[str, torch.Tensor]) – [description]

  • token_position (int) – the position of the token to mask

Returns

[description]

Return type

Tuple[Dict[str, torch.Tensor], List[List]]

_gather_masked_outputs(self, model_outputs: torch.Tensor, masked_ids_list: List[List])torch.Tensor

Gather all the masked outputs to get original tensor shape

Parameters
  • model_outputs (torch.Tensor) – shape -> (sum_tokens, max_seq_len, vocab_size)

  • masked_ids_list (List[List]) – len -> (num_seqs)

Returns

shape -> (num_seqs, max_seq_len, vocab_size)

Return type

model_outputs (torch.Tensor)

_model_evaluation(self, model_inputs: Dict[str, torch.Tensor], batch_size: int = 1, **kwargs)Tuple[torch.Tensor, torch.Tensor]

Compute logits and embeddings

Function which computes logits and embeddings based on a list of sequences, a provided batch size and an inference configuration. The output is obtained by computing a forward pass through the model (“forward inference”)

Parameters
  • model_inputs (Dict[str, torch.tensor]) – [description]

  • batch_size (int) – [description]

Returns

  • logits [num_seqs, max_len_seqs, vocab_size]

  • embeddings [num_seqs, max_len_seqs+1, embedding_size]

Return type

Tuple[torch.tensor, torch.tensor]

_compute_logits(self, model_inputs: Dict[str, torch.Tensor], batch_size: int, pass_mode: str, **kwargs)torch.Tensor

Intermediate function to compute logits

Parameters
  • model_inputs[str] (torch.Tensor) – shape -> (num_seqs, max_seq_len)

  • batch_size (int) –

  • pass_mode (str) –

Returns

shape -> (num_seqs, max_seq_len, vocab_size)

Return type

logits (torch.Tensor)

compute_logits(self, sequences: Union[List[str], str], batch_size: int = 1, pass_mode: str = 'forward', silent: bool = False, n_seqs_msa: int = 6)List[numpy.ndarray]

Function that computes the logits from sequences.

It returns a list of logits arrays for each sequence. If working with MSA, return a list of logits for each sequence of the MSA.

Parameters
  • sequences – List of sequences, path of fasta file or path to a folder with msa to a3m format.

  • batch_size – number of sequences to consider for the forward pass

  • pass_mode – Mode of model evaluation (‘forward’ or ‘masked’)

  • silent – whether to print progress bar in console

  • n_seqs_msa – number of sequence to consider in an msa file.

Returns

logits in np.ndarray format

Return type

List[np.ndarray]

compute_probabilities(self, sequences: Union[List[str], str], batch_size: int = 1, tokens_list: List[str] = None, pass_mode: str = 'forward', silent: bool = False, n_seqs_msa: int = 6, masked_token_position: Optional[List[int]] = None)Union[SequenceProbsList, List[SequenceProbsList]]

Function that computes the probabilities over amino-acids from sequences.

It takes as inputs a list of sequences and returns a list of dictionaries. Each dictionary contains the probabilities over the natural amino-acids for each position in the sequence. The keys represent the positions (indexed starting with 0) and the values are dictionaries of probabilities over the natural amino-acids for this position.

When working with MSA, it returns a list of dictionnary for each sequence in the MSA. In these dictionaries, the keys are the amino-acids and the value the corresponding probabilities.

Both ProtBert and ESM models have more tokens than the 20 natural amino-acids (for instance MASK or PAD tokens). It might not be of interest to take these tokens into account when computing probabilities or log-likelihood. By default we remove them and compute probabilities only over the 20 natural amino-acids. This behavior can be overridden through the tokens_list argument that enable the user to choose the tokens to consider when computing probabilities.

Parameters
  • sequences – List of sequences, path of fasta file or path to a folder

  • msa to a3m format. (with) –

  • batch_size – number of sequences to consider for the forward pass

  • tokens_list – List of tokens to consider

  • pass_mode – Mode of model evaluation (‘forward’ or ‘masked’)

  • silent – display or not progress bar

  • n_seqs_msa – number of sequence to consider in an msa file.

  • masked_token_position – List of positions of a specific token to mask for each sequence. Index from 1 to N for sequence of length N. Number of index to mask should be equal to the number of sequences.

Returns

dictionaries of probabilities per seq

Return type

List[Dict[int, Dict[str, float]]]

compute_loglikelihood(self, sequences: Union[List[str], str], batch_size: int = 1, tokens_list: List[str] = None, pass_mode: str = 'forward', silent: bool = False, normalize: bool = True, masked_token_position: Optional[List[int]] = None)List[float]

Function that computes loglikelihoods of sequences. It returns a list of float values.

Both ProtBert and ESM models have more tokens than the 20 natural amino-acids (for instance MASK or PAD tokens). It might not be of interest to take these tokens into account when computing probabilities or log-likelihood. By default we remove them and compute probabilities only over the 20 natural amino-acids. This behavior can be overridden through the tokens_list argument that enable the user to choose the tokens to consider when computing probabilities. By default, all loglikelihoods are normalized by the sequence length. For example, a loglikelihood of -0.35 means that every amino acids are predicted with a probability of 0.7 in average.

Parameters
  • sequences – List of sequences

  • batch_size – Batch size

  • tokens_list – List of tokens to consider

  • pass_mode – Mode of model evaluation (‘forward’ or ‘masked’)

  • silent – display or not progress bar

  • normalize – If True, loglikelihood are normalize by sequence length.

  • masked_token_position – List of positions of a specific token to mask for each sequence. Index from 1 to N for sequence of length N. Number of index to mask should be equal to the number of sequences.

Returns

list of loglikelihoods, one per sequence

Return type

List[float]

compute_mutation_score(self, sequences: Union[List[str], str], mutations: List[List[str]], batch_size: int = 1, tokens_list: List[str] = None, silent: bool = False)List[float]

Function that computes loglikelihoods of sequences. It returns a list of float values.

This function is used to score the a mutation between two amino acids as described in https://www.biorxiv.org/content/10.1101/2021.07.09.450648v1.full.pdf. This metrics is maximizedin ESM-1V to assess the interest of a mutation. The mutational score is based on the masked marginal probability (L forward passes) where whe introduce a mask at the mutation position and compute the log difference of probability between native and mutate sequence.

score -> Sum(log(p(xi=xi_mutate|x-M))-log(p(xi=xi_native|x-M))) over M (M s a mutation set)

The function takes in input a list of mutations for each sequence to evaluate. Mutations are tuple a single mutation, you can provide multiple mutations for a single sequence. .. rubric:: Example

mutation: “A8E” to mutate amino acids A by E at position 8. mutation are indexed from 1 to N for sequence of length N.

Below a mutations list for 3 sequences. We have to provide 3 tuples of single mutation. mutations: [[“A3U”,”A8E”],[“B7I”],[“I124”,”E1J”]]

Parameters
  • sequences – List of sequences

  • batch_size – Batch size

  • tokens_list – List of tokens to consider

  • silent – display or not progress bar

  • mutations – List of mutations for each sequence to evaluate. Mutations are list a single mutation. mutation are indexed from 1 to N for sequence of length N.

Returns

list of mutations score for each sequence

Return type

List[float]

compute_embeddings(self, sequences: Union[List[str], str], batch_size: int = 1, pool_mode: Tuple[str, Ellipsis] = ('cls', 'mean', 'full'), silent: bool = False, n_seqs_msa: int = 6)Dict[str, Union[List[numpy.ndarray], numpy.ndarray]]

Function that computes embeddings of sequences.

The embedding of one sequence has a shape (sequence_length, embedding_size) where embedding_size equals 768 or 1024., thus we may want to use an aggregation function specified in pool_mode to aggregate the tensor on the num_tokens dimension. It might for instance avoid blowing the machine RAM when computing embeddings for a large number of sequences.

‘mean’ signifies that we take the mean over the num_tokens dimension. ‘cls’ means that only the class token embedding is used.

This function returns a dictionary of lists. The dictionary will have one key per pool-mode that has been specified. The corresponding value is a list of embeddings, one per sequence in sequences.

When working with MSA, an extra dimension is added to the final tensor. :param sequences: List of sequences, path of fasta file or path to a folder with msa to a3m format. :param batch_size: batch size :param pool_mode: Mode of pooling (‘cls’, ‘mean’, ‘full’) :param silent: whereas to display or not progress bar :param n_seqs_msa: number of sequence to consider in an msa file.

Returns

dict matching pool-mode and list of embeddings

Return type

Dict[str, List[np.ndarray]]

compute_accuracy(self, sequences: Union[List[str], str], batch_size: int = 1, pass_mode: str = 'forward', silent: bool = False, n_seqs_msa: int = 6)float

Compute model accuracy from the input sequences

When working with MSA, the accuracy is computed over all the tokens of the msa’ sequences. :param sequences: List of sequences, path of fasta file or path to a folder with msa to a3m format. :type sequences: Union[List[str],str] :param batch_size: [description]. Defaults to 1. :type batch_size: [type], optional :param pass_mode: [description]. Defaults to “forward”. :type pass_mode: [type], optional :param silent: whereas to display or not progress bar :param n_seqs_msa: number of sequence to consider in an msa file.

Returns

model’s accuracy over the given sequences

Return type

float

load_model(self, checkpoint_path: str)

Load model from a lightning checkpoint.

Parameters

checkpoint_path – path to lightning checkpoint

finetune(self, train_sequences: Union[List[str], str], validation_sequences: Union[List[str], str], num_data_workers: int = 4, lr: float = 1e-05, warmup_updates: int = 1024, warmup_init_lr: float = 1e-07, epochs: int = 10, acc_batch_size: int = 50, masking_ratio: float = 0.025, masking_prob: float = 0.8, random_token_prob: float = 0.15, toks_per_batch: int = 2048, crop_sizes: Tuple[int, int] = (512, 1024), accelerator: str = 'ddp', amp_level: str = 'O2', precision: int = 16, logs_save_dir: str = 'logs', logs_name_exp: str = 'finetune_masked', checkpoint: Optional[str] = None, save_last_checkpoint: bool = True)

Function to finetune a model on a specific dataset

This function will finetune the choosen model on a dataset of sequences with pytorch ligthening. You can modify the masking ratio of AA in the arguments for better convergence. Be careful with the accelerator that you use. DDP accelerator will launch multiple python process and do not be use in a notebook.

More informations on GPU/accelerator compatibility here :

https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html

The wisest choice would be to use DDP for multi-gpu training.

Parameters
  • train_sequences – Could be a list of sequences or the path of a fasta file with multiple seqRecords

  • validation_sequences – Could be a list of sequences or the path of a fasta file with multiple seqRecords

  • num_data_workers – number of cpus workers per gpu to load data

  • lr – learning rate for training phase. Defaults to 1.0e-5.

  • warmup_updates – Number of warming updates, number of step while increasing

  • leraning rate. Defaults to 1024. (the) –

  • warmup_init_lr – Initial lr for warming_update. Defaults to 1e-7.

  • epochs – number of epoch for training. Defaults to 10.

  • acc_batch_size – number of batches of toks_per_batch tokens to accumulate before applying gradients.

  • masking_ratio – ratio of tokens to be masked. Defaults to 0.025.

  • masking_prob – probability that the chose token is replaced with a mask token. Defaults to 0.8.

  • random_token_prob – probability that the chose token is replaced with a random token. Defaults to 0.1.

  • toks_per_batch – Maximum number of token to consider in a batch.Defaults to 2048. This argument will set the number of sequences in a batch, which is dynamically computed. Batch size use accumulate_grad_batches to compute accumulate_grad_batches parameter.

  • crop_sizes – range of lengths to crop dynamically sequences when sampling them

  • extra_toks_per_seq – Defaults to 2,

  • accelerator – type of accelerator for mutli-gpu processing (DPP recommanded)

  • amp_level – allow mixed precision. Defaults to ‘02’

  • precision – reducing precision allows to decrease the GPU memory needed. Defaults to 16 (float16)

  • logs_save_dir – Defaults directory to logs.

  • logs_name_exp – Name of the experience in the logs.

  • checkpoint – Path to a checkpoint file to restore training session.

  • save_last_checkpoint – Save last checkpoint and 2 best trainings models to restore training session. Take a large amout of time and memory.