One of the most time-consuming parts of machine learning is cleaning up and analyzing your data. Whether you’re dealing with missing data, inconsistent formatting or just wanting to visualize what you’re working with, these steps can be tedious.

SageMaker Data Wrangler helps tackle that problem by providing an easy user interface to import, analyze and transform your data. And then ultimately, you can export the flow or the transformed data so that it can be re-used or incorporated into the overall machine learning lifecycle.

In this hands-on tutorial, I give a brief overview of where Data Wrangler fits into the ML ecosystem, and then show you where to get the Titanic dataset (https://www.openml.org/d/40945) that we’re using for the demo. Then, launching SageMaker Studio, I show you how to create a new Data Wrangler Flow, which is the pipeline that “holds” the other steps. From there, we import the Titanic dataset, analyze and transform it, and then look at options for exporting.

Links and code used in this video:
• Titanic dataset: https://www.openml.org/d/4094
• Python Pandas and PySpark SQL code used in the demo: https://docs.google.com/document/d/1TFibtT8cnw2hkX64f-h0yOxnpIcmhwacrMbo9WTGohk/edit?usp=sharing
• You might also be interested in Getting Started with SageMaker Studio: https://youtu.be/91z9s7iboeM

??If you’re interested in getting AWS certifications, check out these full courses. They include lots of hands-on demos, quizzes and full practice exams. Use FRIENDS10 for a 10% discount!
– AWS Certified Cloud Practitioner: https://academy.zerotomastery.io/a/aff_n20ghyn4/external?affcode=441520_lm7gzk-d
– AWS Certified Solutions Architect Associate: https://academy.zerotomastery.io/a/aff_464yrtnn/external?affcode=441520_lm7gzk-d

00:00 – A brief look at the need for SageMaker Data Wrangler
02:15 – Overviewing the Titanic passenger survival dataset used in this video
04:19 – Launching SageMaker Studio
04:40 – Creating a new Data Wrangler Flow
05:45 – Uploading the Titanic dataset to an S3 bucket
06:39 – Importing the Titanic dataset from S3 into Data Wrangler
07:17 – Editing data types in Data Wrangler
07:48 – Adding data analysis in Data Wrangler
08:18 – Adding a table summary analysis in Data Wrangler
09:38 – Adding a histogram analysis in Data Wrangler
10:45 – Adding data transforms in Data Wrangler
11:02 – Adding a data transform to drop columns in Data Wrangler
12:12 – Adding a data transform to handle missing data in Data Wrangler
12:55 – Adding a custom transform with Python (Pandas) in Data Wrangler
13:30 – Encoding data using a custom transform with Python (Pandas) in Data Wrangler
15:04 – Adding a custom transform with PySpark SQL in Data Wrangler
15:40 – Exporting a Data Wrangler Flow
17:48 – IMPORTANT! Deleting your SageMaker resources

Similar Posts