PyTorch on Google Cloud. How to train PyTorch model on AI platform

PyTorch on Google Cloud. How to train PyTorch model on AI platform

PyTorch is an open source machine learning and deep learning library, mainly developed by Facebook, used in more and more use cases to automate machine learning tasks, such as image recognition, natural language processing, translation, recommendation systems, etc. PyTorch is mainly used for research. In recent years, it has also gained huge appeal in the industry due to its ease of use and deployment .

Google Cloud Artificial Intelligence Platform is a fully managed end-to-end platform for data science and machine learning on Google Cloud. Utilizing Google's expertise in artificial intelligence, the AI platform provides a flexible, scalable and reliable platform to run your machine learning workloads. The AI platform has built-in support for PyTorch through deep learning containers . These containers have been optimized for performance, tested for compatibility, and can be deployed at any time.

In this new series of blog posts "PyTorch on Google Cloud", we aim to share how to build, train and deploy PyTorch models on a large scale and how to create a repeatable machine learning pipeline on Google Cloud.

Why use PyTorch on Google Cloud AI platform?

The cloud AI platform provides flexible and scalable hardware and secure infrastructure to train and deploy PyTorch-based deep learning models.

  • Flexibility . The AI platform notebook and AI platform training provide flexibility. You can design your computing resources to match any workload, while the platform manages most of the dependencies, networks, and monitoring. Spend your time building models instead of worrying about infrastructure.
  • Scalability . Use pre-built PyTorch containers or custom containers to run your experiments on AI platform notebooks, and extend your code by training models on GPU or TPU and using the high availability of AI platform training.
  • Security . The AI platform utilizes the same global-scale technical infrastructure and aims to provide security through Google's entire information processing life cycle.
  • Support . The AI platform works closely with PyTorch and NVIDIA to ensure top compatibility between the AI platform and NVIDIA GPU, including PyTorch framework support.

The following is a quick reference to the support status of PyTorch on Google Cloud

(Click to enlarge)

In this article, we will introduce.

  1. Use the AI platform notebook to set up the PyTorch development environment on the JupyterLab notebook
  2. Use PyTorch to build a sentiment classification model and train it on the AI platform training

You can find the supporting code for this blog post on the GitHub repository and Jupyter notebook .

let's start!

Use cases and data sets

In this article, we will use PyTorch to fine-tune a deformer model (BERT-base ) in the Huggingface deformer library to complete sentiment analysis tasks. BERT (Bidirectional Encoder Representations from Transformers) is a transformer model that is pre-trained on a large corpus of unlabeled text in a self-supervised manner. We will start experiments on the IMDB sentiment classification dataset on the AI platform notebook . We recommend using an AI platform notebook instance with a limited amount of calculations for development and experimental purposes. Once we are satisfied with the local experiments on the notebook, we will show how to submit the same Jupyter notebook to the AI platform training service in order to expand the training with a larger GPU shape . The AI platform training service optimizes the training pipeline by starting the infrastructure for training and closing it after the training is completed, without requiring you to manage the infrastructure.

In the next article, we will show how to deploy and serve these PyTorch models on the AI Platform Prediction service.

Create a development environment on the AI platform notebook

We will use JupyterLab notebook as the development environment on AI Platform Notebooks. Before you start , you must set up a project on Google Cloud Platform and enable the AI Platform Notebooks API .

Please note that when you create an AI platform notebook instance , you will be charged . You only pay for the time your notebook instance is up and running. You can choose to stop the instance , which will save your work and only charge for startup disk storage until you restart the instance. Please delete the instance when you are finished.

You can create a notebook instance of the AI platform.

  1. Use the pre-built PyTorch image from the AI platform Deep Learning Virtual Machine (DLVM) image or
  2. Use a custom container with your own package

Create a notebook instance using the pre-built PyTorch DLVM image

The AI Platform Notebooks instance is an AI Platform deep learning avatar instance. The JupyterLab notebook environment is enabled and can be used at any time. AI platform notebooks provide PyTorch image series and support multiple PyTorch versions. You can create a new notebook instance from the Google Cloud Console or the command line interface (CLI). We will use the gcloud CLI to create a notebook instance on the NVIDIA Tesla T4 GPU. From Cloud Shell or any terminal with Cloud SDK installed , run the following command to create a new notebook instance.

To interact with a new notebook instance, go to the AI Platform Notebooks page in the Google Cloud Console , click the "OPEN JUPYTERLAB" link next to the new instance, and it will become active when it is available.

Most of the libraries required for experimenting with PyTorch have been installed on the new instance through the pre-built PyTorch DLVM image. To install additional dependencies, run %pip install in the notebook unit. For the use case of sentiment classification, we will install additional software packages such as Hugging Face transformer and dataset library .

Notebook instance with custom container

Another way to install dependencies with pip in the Notebook instance is to package the dependencies in a Docker container image derived from the AI platform's deep learning container image , and create a custom container. You can use this custom container to create AI platform notebook instances or AI platform training jobs. The following is an example of using a custom container to create a notebook instance .

1. Create one

, Use one of the AI platform deep learning container images as the base image (here we use PyTorch 1.7 GPU image) and run/install the package or framework you need. Use cases for sentiment classification include converters and data sets .

2. Use cloud build from the terminal or cloud shell to build the image from the Docker file and get the image location{project_id}/{image_name}

3. Use the command line to create a notebook instance with the custom image created in step 2 .

Train a PyTorch model on the AI platform

After creating an AI platform notebook instance, you can start experimenting. Let's take a look at the specific situation of the model of this use case.

The specifics of the model

In order to analyze the sentiment of movie reviews in the IMDB dataset , we will fine-tune the pre-trained BERT model of Hugging Face . Fine-tuning involves taking a model that has been trained for a specific task and then adjusting the model for another similar task. Specifically, adjustments include copying all layers in the pre-trained model, including weights and parameters, except for the output layer. Then add a new output classifier layer to predict the label of the current task. The last step is to train the output layer from scratch, and the parameters of all layers of the pre-trained model are frozen. This allows learning from pre-trained representations and "fine-tuning" higher-level feature representations that are more relevant to specific tasks, such as analyzing emotions in this case.

For the emotional analysis scenario here, the pre-trained BERT model has already encoded a lot of information about the language, because the model is trained on a large English data corpus in a self-supervised manner. Now, we only need to use their output as a feature for sentiment classification tasks to make minor adjustments. This means faster development iterations on a smaller data set, rather than using a larger training data set to train a specific natural language processing (NLP) model.

Pre-trained model with classification layer. The blue box represents the pre-trained BERT encoder module. The output of the encoder is collected into the linear layer, and the number of outputs is the same as the number of target tags (categories).

In order to train the sentiment classification model, we will.

  • Preprocessing and conversion (tokenization) of review data.
  • Load the pre-trained BERT model and add a sequence classification header for sentiment analysis
  • Fine-tune the BERT model for sentence classification

The following is a code snippet for preprocessing the data and fine-tuning the pre-trained BERT model. For the complete code and detailed explanation of these tasks, please refer to the Jupyter notebook .

In the above snippet, notice that the weights of the encoder (also known as the base model) are not frozen. This is why a very small learning rate (2e-5) is chosen to avoid the loss of pre-training representations. The learning rate and other hyperparameters are captured under the TrainingArguments object. During the training process, we only capture accuracy indicators. You can modify the compute_metrics function to capture and report other metrics.

We will discuss the integration with cloud AI platform hyperparameter tuning services in the next article of this series .

Train the model on the cloud AI platform

Although you can do local experiments on your AI platform notebook instance, for larger data sets or models, it often requires vertically expanded computing resources or horizontally distributed training. The most effective way to perform this task is the AI platform training service. AI Platform Training is responsible for creating designated computing resources required for tasks, executing training tasks, and ensuring that computing resources are deleted after training is completed.

Before training and running the training application on the AI platform, the training application code with the required dependencies must be packaged and uploaded to a Google cloud storage bucket accessible to your Google cloud project. There are two ways to package applications and run them on AI platform training.

  1. Use the Python setup tool to manually package the application and Python dependencies
  2. Use custom containers , use Docker containers to package dependencies

You can structure your training code in any way you like. Refer to the GitHub repository or Jupyter laptop , we understand the recommendation of structured training code method .

Use Python wrapper to build manually

For this sentiment classification task, we must combine the training code with the standard Python dependencies--

--Packed in
The function includes the training code as a dependency in the software package.

Now, with the gcloud SDK installed, you can use the gcloud command from Cloud Shell or the terminal to submit training jobs to the cloud AI platform training. gcloud ai-platform jobs submit training commands to perform phased training on the training application on the GCS bucket, and submit training jobs. We attach 2 NVIDIA Tesla T4 GPUs to the training job to speed up training.

Use custom containers for training

To create a training job with a custom container , you must define a Docker file to install the dependencies required by the training job. Then, you build and test your Docker image locally to verify it, and then use it for AI platform training.

Before submitting the training job, you need to push the image to the Google Cloud Container Registry , and then use the gcloud ai-platform jobs submit training command to submit the training job to the cloud AI platform for training.

Once the job is submitted, you can monitor the status and progress of the training job in the Google Cloud console or using the gcloud command, as shown in the figure below.

You can also monitor the job status and view the job log from the Google AI platform job console .

Let's use a few examples to run prediction calls on the trained model locally (please refer to the notebook for the complete code ). The next article in this series will tell you how to deploy this model on AI platform prediction services.

Clean up the notebook environment

After you finish the experiment, you can stop or delete the AI notebook instance. Delete the AI notebook instance to prevent further charges. If you want to save your work, you can choose to stop the instance instead.

What's the next step?

In this article, we explored the cloud AI platform notebook as a fully customizable IDE developed for the PyTorch model. Then, we trained the model on the cloud AI platform training service, which is a fully manageable service for large-scale training of machine learning models.


* AI platform notebook introduction


  • Code and attached notebook

In the next article in this series, we will study the hyperparameter tuning on the cloud AI platform and the deployment of the PyTorch model on the AI platform prediction service. We encourage you to explore the capabilities of the cloud AI platform we are studying .

Please stay tuned. Thank you for reading! Have questions or want to chat? Find the authors here - Rajesh [Twitter | LinkedIn ] and Vaibhav [LinkedIn ].

Thanks to Amy Unruh and Karl Weinmeister for their help and comments.