biotransformers.wrappers.esm_wrappers
Contents
biotransformers.wrappers.esm_wrappers¶
This script defines a class which inherits from the LanguageModel class, and is specific to the ESM model developed by FAIR (https://github.com/facebookresearch/esm).
Module Contents¶
Classes¶
Class that uses an ESM type of pretrained transformers model to evaluate |
Attributes¶
- biotransformers.wrappers.esm_wrappers.log¶
- biotransformers.wrappers.esm_wrappers.path_msa_folder¶
- class biotransformers.wrappers.esm_wrappers.ESMWrapper(model_dir: str, device: str)¶
Bases:
biotransformers.wrappers.language_model.LanguageModelClass that uses an ESM type of pretrained transformers model to evaluate a protein likelihood so as other insights.
- property model(self) → torch.nn.Module¶
Return torch model.
- set_model(self, model: torch.nn.Module)¶
Set torch model.
- property clean_model_id(self) → str¶
Clean model ID (in case the model directory is not)
- property model_vocabulary(self) → List[str]¶
Returns the whole vocabulary list
- property vocab_size(self) → int¶
Returns the whole vocabulary size
- property mask_token(self) → str¶
Representation of the mask token (as a string)
- property pad_token(self) → str¶
Representation of the pad token (as a string)
- property begin_token(self) → str¶
Representation of the beginning of sentence token (as a string)
- property end_token(self) → str¶
Representation of the end of sentence token (as a string)
- property does_end_token_exist(self) → bool¶
Returns true if a end of sequence token exists
- property token_to_id(self)¶
Returns a function which maps tokens to IDs
- property embeddings_size(self)¶
Returns size of the embeddings
- process_sequences_and_tokens(self, sequences_list: List[str]) → Dict[str, torch.Tensor]¶
Function to transform tokens string to IDs; it depends on the model used
- model_pass(self, model_inputs: Dict[str, torch.Tensor], batch_size: int, silent: bool = False, pba: ray.actor.ActorHandle = None) → Tuple[torch.Tensor, torch.Tensor]¶
Function which computes logits and embeddings based on a list of sequences, a provided batch size and an inference configuration. The output is obtained by computing a forward pass through the model (“forward inference”)
The datagenerator is not the same the multi_gpus inference. We use a tqdm progress bar that is updated by the worker. The progress bar is instantiated before ray.remote
- Parameters
model_inputs (Dict[str, torch.tensor]) – [description]
batch_size (int) – size of the batch
silent – display or not progress bar
pba – tqdm progress bar for ray actor
- Returns
logits [num_seqs, max_len_seqs, vocab_size]
embeddings [num_seqs, max_len_seqs+1, embedding_size]
- Return type
Tuple[torch.tensor, torch.tensor]
- get_alphabet_dataloader(self)¶
Define an alphabet mapping for common method between protbert and ESM