Jose Luis Silva, Ph.D.

0 %
Jose Luis Silva, Ph.D.
Physicist || Ph.D. || Founder
  • PhD. in Physics
    UU ๐Ÿ‡ธ๐Ÿ‡ช
  • Postdoc in AI:
    LiU ๐Ÿ‡ธ๐Ÿ‡ช
  • Founder
    Aicavity Academy ๐Ÿ‡ธ๐Ÿ‡ช
  • Co-Founder:
    Oxaala LTDA. ๐Ÿ‡ธ๐Ÿ‡ช ๐Ÿ‡ง๐Ÿ‡ท
  • Life:
    Brazilian-Swedish ๐Ÿ‡ง๐Ÿ‡ท ๐Ÿ‡ธ๐Ÿ‡ช
Research Interests:
  • Artificial Intelligence & Machine Learning
  • Graphs, Computer Vision & NLP
  • Data Science, Analytics & Decision Making
  • Deep Learning & Reinforcement Learning

Build, Train & Deploy Machine Learning Pipelines using BERT, RoBERTa and Amazon Sagemaker

May 13, 2022

Build and Deploy Machine Learning Pipelines using BERT and RoBERTa and Amazon Sagemaker

3 Projects Description:

Automate a natural language processing task by building an end-to-end machine learning pipeline using Hugging Faceโ€™s highly-optimized implementation of the state-of-the-art BERT algorithm with Amazon SageMaker Pipelines. The pipeline will first transform the dataset into BERT-readable features and store the features in the Amazon SageMaker Feature Store. It will then fine-tune a text classification model to the dataset using a Hugging Face pre-trained model, which has learned to understand the human language from millions of Wikipedia documents. Finally, your pipeline will evaluate the modelโ€™s accuracy and only deploy the model if the accuracy exceeds a given threshold.

My Solutions: Practical Data Science Projects from Coursera, DeepLearning.AI and Amazon Web Services

– My Certificate –

Description:ย  In these projects you will be handling massive datasets that do not fit in your local hardware and could originate from multiple sources. We will use Amazon SageMaker to Build, Train and Deploy our Machine Learning pipeline using BERT and RoBERTa models. One of the biggest benefits of developing and running data science projects in the cloud is the agility and elasticity that the cloud offers to scale up and out at a minimum cost. It is designed for data-focused developers, scientists, and analysts familiar with the Python and SQL programming languages and want to learn how to build, train, and deploy scalable, end-to-end ML pipelines – both automated and human-in-the-loop – in the AWS cloud.

ML Pipeline using Amazon Sagemaker

Project 1:

Feature transformation with Amazon SageMaker processing job and Feature Store

Start with the raw Women’s Clothing Reviews dataset and prepare it to train a BERT-based natural language processing (NLP) model. The model will be used to classify customer reviews into positive (1), neutral (0) and negative (-1) sentiment.

Convert the original review text into machine-readable features used by BERT. To perform the required feature transformation you will configure an Amazon SageMaker processing job, which will be running a custom Python script.


1. Configure the SageMaker Feature Store

2. Transform the Dataset

3. Inspect the transformed Data

4. Query the Feature Store

Project 2:

SageMaker pipelines to Build, Train and Deploy RoBERTa text classifier

train a text classifier using a variant of BERT called RoBERTa – a Robustly Optimized BERT Pretraining Approach – within a PyTorch model ran as a SageMaker Training Job.

Train a review classifier with BERT and Amazon SageMaker

5. Configure dataset

6. Configure model hyper-parameters

7. Setup evaluation metrics, debugger and profiler

8. Train model

9. Analyze debugger results

10. Deploy and test the model

Project 3:

SageMaker pipelines to train a BERT-Based text classifier

  • Define and run a pipeline using a directed acyclic graph (DAG) with specific pipeline parameters and model hyper-parameters
  • Define a processing step that cleans, balances, transforms, and splits our dataset into train, validation, and test dataset
  • Define a training step that trains a model using the train and validation datasets
  • Define a processing step that evaluates the trained model’s performance on the test dataset
  • Define a register model step that creates a model package from the trained model
  • Define a conditional step that checks the model’s performance and conditionally registers the model for deployment


This notebook focuses on the following features of Amazon SageMaker Pipelines:

  • Pipelines – a directed acyclic graph (DAG) of steps and conditions to orchestrate SageMaker jobs and resource creation
  • Processing job steps – a simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model explainability
  • Training job steps – an iterative process that teaches a model to make predictions on new data by presenting examples from a training dataset
  • Conditional step execution – provides conditional execution of branches in a pipeline
  • Registering models – register a model in a model registry to create a deployable models in Amazon SageMaker
  • Parameterized pipeline executions – allows pipeline executions to vary by supplied parameters
  • Model endpoint – hosts the model as a REST endpoint to serve predictions from new data
  • Configure dataset and processing step
  • Configure training step
  • Configure model-evaluation step
  • Configure register model step
  • Create model for deployment step
  • Check accuracy condition step
  • Create and start pipeline
  • List pipeline artifacts
  • Approve and deploy model

BERT Pipeline

The pipeline that you will create follows a typical machine learning application pattern of pre-processing, training, evaluation, and model registration.

In the processing step, you will perform feature engineering to transform the review_body text into BERT embeddings using the pre-trained BERT model and split the dataset into train, validation and test files. The transformed dataset is stored in a feature store. To optimize for Tensorflow training, the transformed dataset files are saved using the TFRecord format in Amazon S3.

In the training step, you will fine-tune the BERT model to the customer reviews dataset and add a new classification layer to predict the sentiment for a given review_body.

In the evaluation step, you will take the trained model and a test dataset as input, and produce a JSON file containing classification evaluation metrics.

In the condition step, you will register the trained model if the accuracy of the model, as determined by our evaluation step, exceeds a given threshold value.

Posted in Artificial Intelligence, Deep Learning, Machine LearningTags:
Write a comment