Training models#
This tutorial teaches you how to train models using OpenProtein.AI’s Property Regression models. These models can be used to make predictions for new sequence variants and to design libraries of optimized sequences.
What you need before getting started#
You need an uploaded dataset and assay object in order to create training jobs. For more information, see Uploading your data.
Training a model#
Use the assay object to create a training job:
[ ]:
train = session.train.create_training_job(assay,
measurement_name=["isobutyramide_normalized_fitness"],
model_name="mymodel") # name the resulting model
train_id = train.id
train
Job(status=<JobStatus.PENDING: 'PENDING'>, job_id='421fc451-c60e-40ce-9be0-681f6b6be3c7', job_type='/workflow/train', created_date=datetime.datetime(2024, 5, 9, 5, 59, 54, 601657), start_date=None, end_date=None, prerequisite_job_id='51169089-7402-45cb-86d7-85111267ac72', progress_message=None, progress_counter=None, num_records=None, sequence_length=346)
[ ]:
train.refresh()
train.status
<JobStatus.PENDING: 'PENDING'>
Wait for the results before proceeding:
[ ]:
results = train.wait(verbose=False)
You can display your results as a scatterplot:
[ ]:
import seaborn as sns # plotting
import matplotlib.pyplot as plt
sns.scatterplot(x=range(len(results["isobutyramide_normalized_fitness"])), y=results["isobutyramide_normalized_fitness"] )
plt.xlabel("Steps")
plt.ylabel("Loss");
Request a cross-validation job to see the training results in more detail:
[ ]:
cvjob = train.crossvalidate()
cvjob.status
<JobStatus.PENDING: 'PENDING'>
[ ]:
cvdata = cvjob.wait(verbose=True)
Waiting: 100%|██████████| 100/100 [00:00<00:00, 1457.25it/s, status=SUCCESS]
[ ]:
import pandas as pd
cvdata = pd.DataFrame(cvdata)
cvdata.head()
row_index | sequence | measurement_name | y | y_mu | y_var | |
---|---|---|---|---|---|---|
0 | 0 | WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMK... | isobutyramide_normalized_fitness | -0.5174 | -0.515400 | 0.000019 |
1 | 1 | WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMK... | isobutyramide_normalized_fitness | -0.5154 | -0.517399 | 0.000025 |
2 | 2 | WRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKIAEMIVGMK... | isobutyramide_normalized_fitness | -0.5154 | -0.517399 | 0.000025 |
3 | 4 | MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKYAEMIVGMK... | isobutyramide_normalized_fitness | -0.7448 | -0.328170 | 0.052515 |
4 | 5 | MRHGDISSSNDTVGVAVVNYKMPRLHTAAEVLDNARKPAEMIVGMK... | isobutyramide_normalized_fitness | -0.7805 | -0.596381 | 0.071689 |
[ ]:
sns.regplot(x=cvdata.y.to_list() , y=cvdata.y_mu.to_list() )
plt.xlabel("Y")
plt.title("Cross validation results")
plt.ylabel("Y-hat");
Next steps#
Our Train API page contains more information about training your models.
You can use your trained model to perform a single site analysis or design sequences. See Using single site analysis and Designing sequences for details.