Build and Deploy Machine Learning Pipelines using BERT and RoBERTa and Amazon Sagemaker
3 Projects Description:
Automate a natural language processing task by building an end-to-end machine learning pipeline using Hugging Face’s highly-optimized implementation of the state-of-the-art BERT algorithm with Amazon SageMaker Pipelines. The pipeline will first transform the dataset into BERT-readable features and store the features in the Amazon SageMaker Feature Store. It will then fine-tune a text classification model to the dataset using a Hugging Face pre-trained model, which has learned to understand the human language from millions of Wikipedia documents. Finally, your pipeline will evaluate the model’s accuracy and only deploy the model if the accuracy exceeds a given threshold.
My Solutions: Practical Data Science Projects from Coursera, DeepLearning.AI and Amazon Web Services
Description: In these projects you will be handling massive datasets that do not fit in your local hardware and could originate from multiple sources. We will use Amazon SageMaker to Build, Train and Deploy our Machine Learning pipeline using BERT and RoBERTa models. One of the biggest benefits of developing and running data science projects in the cloud is the agility and elasticity that the cloud offers to scale up and out at a minimum cost. It is designed for data-focused developers, scientists, and analysts familiar with the Python and SQL programming languages and want to learn how to build, train, and deploy scalable, end-to-end ML pipelines – both automated and human-in-the-loop – in the AWS cloud.
ML Pipeline using Amazon Sagemaker
Feature transformation with Amazon SageMaker processing job and Feature Store
Start with the raw Women’s Clothing Reviews dataset and prepare it to train a BERT-based natural language processing (NLP) model. The model will be used to classify customer reviews into positive (1), neutral (0) and negative (-1) sentiment.
Convert the original review text into machine-readable features used by BERT. To perform the required feature transformation you will configure an Amazon SageMaker processing job, which will be running a custom Python script.
1. Configure the SageMaker Feature Store
2. Transform the Dataset
3. Inspect the transformed Data
4. Query the Feature Store
SageMaker pipelines to Build, Train and Deploy RoBERTa text classifier
train a text classifier using a variant of BERT called RoBERTa – a Robustly Optimized BERT Pretraining Approach – within a PyTorch model ran as a SageMaker Training Job.
Train a review classifier with BERT and Amazon SageMaker
5. Configure dataset
6. Configure model hyper-parameters
7. Setup evaluation metrics, debugger and profiler
8. Train model
9. Analyze debugger results
10. Deploy and test the model
SageMaker pipelines to train a BERT-Based text classifier
- Define and run a pipeline using a directed acyclic graph (DAG) with specific pipeline parameters and model hyper-parameters
- Define a processing step that cleans, balances, transforms, and splits our dataset into train, validation, and test dataset
- Define a training step that trains a model using the train and validation datasets
- Define a processing step that evaluates the trained model’s performance on the test dataset
- Define a register model step that creates a model package from the trained model
- Define a conditional step that checks the model’s performance and conditionally registers the model for deployment
This notebook focuses on the following features of Amazon SageMaker Pipelines:
- Pipelines – a directed acyclic graph (DAG) of steps and conditions to orchestrate SageMaker jobs and resource creation
- Processing job steps – a simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model explainability
- Training job steps – an iterative process that teaches a model to make predictions on new data by presenting examples from a training dataset
- Conditional step execution – provides conditional execution of branches in a pipeline
- Registering models – register a model in a model registry to create a deployable models in Amazon SageMaker
- Parameterized pipeline executions – allows pipeline executions to vary by supplied parameters
- Model endpoint – hosts the model as a REST endpoint to serve predictions from new data
- Configure dataset and processing step
- Configure training step
- Configure model-evaluation step
- Configure register model step
- Create model for deployment step
- Check accuracy condition step
- Create and start pipeline
- List pipeline artifacts
- Approve and deploy model
The pipeline that you will create follows a typical machine learning application pattern of pre-processing, training, evaluation, and model registration.
In the processing step, you will perform feature engineering to transform the
review_body text into BERT embeddings using the pre-trained BERT model and split the dataset into train, validation and test files. The transformed dataset is stored in a feature store. To optimize for Tensorflow training, the transformed dataset files are saved using the TFRecord format in Amazon S3.
In the training step, you will fine-tune the BERT model to the customer reviews dataset and add a new classification layer to predict the
sentiment for a given
In the evaluation step, you will take the trained model and a test dataset as input, and produce a JSON file containing classification evaluation metrics.
In the condition step, you will register the trained model if the accuracy of the model, as determined by our evaluation step, exceeds a given threshold value.