biotransformers.wrappers.rostlab_wrapper

This script defines a class which inherits from the LanguageModel class, and is specific to the Rostlab models (eg ProtBert and ProtBert-BFD) developed by hugging face - ProtBert: https://huggingface.co/Rostlab/prot_bert - ProtBert BFD: https://huggingface.co/Rostlab/prot_bert_bfd

Module Contents

Classes

RostlabWrapper

Class that uses a rostlab type of pretrained transformers model to evaluate

Attributes

log

biotransformers.wrappers.rostlab_wrapper.log
class biotransformers.wrappers.rostlab_wrapper.RostlabWrapper(model_dir: str, device)

Bases: biotransformers.wrappers.language_model.LanguageModel

Class that uses a rostlab type of pretrained transformers model to evaluate a protein likelihood so as other insights.

property model(self)torch.nn.Module

Return torch model.

set_model(self, model: torch.nn.Module)

Set torch model.

property clean_model_id(self)str

Clean model ID (in case the model directory is not)

property model_vocabulary(self)List[str]

Returns the whole vocabulary list

property vocab_size(self)int

Returns the whole vocabulary size

property mask_token(self)str

Representation of the mask token (as a string)

property pad_token(self)str

Representation of the pad token (as a string)

property begin_token(self)str

Representation of the beginning of sentence token (as a string)

property end_token(self)str

Representation of the end of sentence token (as a string).

property does_end_token_exist(self)bool

Returns true if a end of sequence token exists

property token_to_id(self)

Returns a function which maps tokens to IDs

property embeddings_size(self)int

Returns size of the embeddings

process_sequences_and_tokens(self, sequences_list: List[str])Dict[str, torch.tensor]

Function to transform tokens string to IDs; it depends on the model used

model_pass(self, model_inputs: Dict[str, torch.tensor], batch_size: int, silent: bool = False, pba: ray.actor.ActorHandle = None)Tuple[torch.Tensor, torch.Tensor]

Function which computes logits and embeddings based on a dict of sequences tensors, a provided batch size and an inference configuration. The output is obtained by computing a forward pass through the model (“forward inference”)

Parameters
  • model_inputs (Dict[str, torch.tensor]) – [description]

  • batch_size (int) – size of the batch

  • silent – display or not progress bar

  • pba – tqdm progress bar for ray actor

Returns

  • logits [num_seqs, max_len_seqs, vocab_size]

  • embeddings [num_seqs, max_len_seqs+1, embedding_size]

Return type

Tuple[torch.tensor, torch.tensor]

get_alphabet_dataloader(self)

Define an alphabet mapping for common method between protbert and ESM