# Finetuning

## How to finetune a model?

`bio-transformers` uses pytorch-lightning to easily load pre-trained model and finetune it on your own datasets. The method `finetune` automatically scale on your visible GPU to train in parallel thanks to the different accelerator.

It is strongly recommended to use the `DDP` accelerator for training : [ddp](https://pytorch.org/docs/stable/notes/ddp.html). You should know that `DDP` will launch several python instances, as a consequence, a model should be finetuned in a separate script, and not be mixed with inference function like `compute_loglikelihood` or `compute_embeddings` to avoid GPU conflicts.

The model will be finetuned randomly by masking a proportion of amino acid in a sequence it commonly does in most state of the art paper. By default, 15% of amino acids will be masked;

```{caution}
This method is developed to be runned on GPU, please take care to have the proper CUDA installation. Refer to this section for more informations.
```

Do not train model `DDP` **accelerator** in a notebook. Do not mix training and compute inference function like `compute_accuracy` or `compute_loglikelihood`  in the same script except with `DP` acceletator.
 With `DDP`, load the finetune model in a separate script like below.

```python
from biotransformers import BioTransformers

bio_trans = BioTransformers("esm1_t6_43M_UR50S", num_gpus=1)
bio_trans.load_model("logs/finetune_masked/version_X/esm1_t6_43M_UR50S_finetuned.pt")
acc_after = bio_trans.compute_accuracy(..., batch_size=32)
```

## Parameters

The function can handle a fasta file or a list of sequences directly:

- **train_sequences**: Could be a list of sequence of a the path of a fasta files with SeqRecords.

Seven arguments are important for the training:

- **lr**: the default learning rate (keep it low : <5e10-4)
- **warmup_updates**:  the number of step (not epochs, optimizer step) to do while increasing the leraning rate from a **warmup_init_lr** to **lr**.
- **epochs** :  number of epoch for training. Defaults to 10.
- **batch_size** :  This size is only uses internally to compute the **accumulate_grad_batches** for gradient accumulation (TO BE UPDATED). The **toks_per_batch** will dynamically determine the number of sequences in a batch, in order to avoid GPU saturation.
- **acc_batch_size** : Number of batch to consider befor computing gradient.

Three arguments allow to custom the masking function used for building the training dataset:

- **masking_ratio** : ratio of tokens to be masked. Defaults to 0.025.
- **random_token_prob** : the probability that the chose token is replaced with a random token.
- **masking_prob**: the probability that the chose token is replaced with a mask token.

All the results will be saved in logs directory:

- **logs_save_dir**: Defaults directory to logs.
- **logs_name_exp**: Name of the experience in the logs.
- **checkpoint**: Path to a checkpoint file to restore training session.
- **save_last_checkpoint**: Save last checkpoint and 2 best trainings models
to restore the training session. Take a large amount of time and memory.

## Example : training script

Training on some swissprot sequences. Training only works on GPU.

```python
import biodatasets
import numpy as np
from biotransformers import BioTransformers
import ray

data = biodatasets.load_dataset("swissProt")
X, y = data.to_npy_arrays(input_names=["sequence"])
X = X[0]

# Train on small sequence
length = np.array(list(map(len, X))) < 200
train_seq = X[length][:15000]

ray.init()
bio_trans = BioTransformers("esm1_t6_43M_UR50S", num_gpus=4)

bio_trans.finetune(
    train_seq,
    lr=1.0e-5,
    warmup_init_lr=1e-7,
    toks_per_batch=2000,
    epochs=20,
    batch_size=16,
    acc_batch_size=256,
    warmup_updates=1024,
    accelerator="ddp",
    checkpoint=None,
    save_last_checkpoint=False,
)
```

## Example : evaluation script

You can easily assees the quality of your finetuning by using the provided function such as `compute_accuracy`.

```python
import biodatasets
import numpy as np
from biotransformers import BioTransformers
import ray


data = biodatasets.load_dataset("swissProt")
X, y = data.to_npy_arrays(input_names=["sequence"])
X = X[0]

# Train sequence with length less than 200 AA
# Test on sequence that was not used for training.
length = np.array(list(map(len, X))) < 200
train_seq = X[length][15000:20000]

ray.init()
bio_trans = BioTransformers("esm1_t6_43M_UR50S", num_gpus=4)
acc_before = bio_trans.compute_accuracy(train_seq, batch_size=32)
print(f"Accuracy before finetuning : {acc_before}")
```

```python
>> Accuracy before finetuning : 0.46
```

```python
bio_trans.load_model("logs/finetune_masked/version_X/esm1_t6_43M_UR50S_finetuned.pt")
acc_after = bio_trans.compute_accuracy(train_seq, batch_size=32)
print(f"Accuracy after finetuning : {acc_after}")
```

```python
>> Accuracy before finetuning : 0.76
```