Logo for AiToolGo

Amazon SageMaker Data Wrangler: Simplify Data Preparation for Machine Learning

In-depth discussion
Technical, easy to understand
 0
 0
 83
This article provides a comprehensive guide on how to access and utilize Amazon SageMaker Data Wrangler, covering prerequisites, data preparation, and model training using the Titanic dataset. It includes step-by-step instructions for importing data, applying transformations, and exporting data flows.
  • main points
  • unique insights
  • practical applications
  • key topics
  • key insights
  • learning outcomes
  • main points

    • 1
      Detailed step-by-step instructions for using Data Wrangler
    • 2
      Practical examples using the Titanic dataset
    • 3
      Comprehensive coverage of data preparation and model training
  • unique insights

    • 1
      Integration of Data Wrangler with Amazon S3 for data import
    • 2
      Use of built-in transformations and custom Python code for data cleaning
  • practical applications

    • The article provides practical guidance for users to effectively prepare data for machine learning, making it valuable for both beginners and experienced users.
  • key topics

    • 1
      Data preparation using Data Wrangler
    • 2
      Model training with XGBoost
    • 3
      Integration with Amazon S3
  • key insights

    • 1
      Hands-on tutorial with a real dataset
    • 2
      Clear instructions for both novice and advanced users
    • 3
      Focus on practical applications of data preparation tools
  • learning outcomes

    • 1
      Understanding how to access and use Amazon SageMaker Data Wrangler
    • 2
      Ability to prepare data for machine learning models
    • 3
      Knowledge of integrating Data Wrangler with AWS services
examples
tutorials
code samples
visuals
fundamentals
advanced content
practical tips
best practices

Introduction to Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler is a powerful tool within Amazon SageMaker Studio Classic designed to streamline and simplify the data preparation process for machine learning (ML) projects. It provides a user-friendly, visual interface that allows data scientists and ML engineers to efficiently import, analyze, transform, and export data. By using Data Wrangler, users can significantly reduce the time and effort required to prepare data, enabling them to focus more on model development and deployment. This comprehensive guide will walk you through the essential aspects of Data Wrangler, from setting it up to leveraging its advanced features for data manipulation and model training.

Prerequisites for Using Data Wrangler

Before you can start using Amazon SageMaker Data Wrangler, you need to ensure that you have met the necessary prerequisites. These include having access to an Amazon EC2 instance, configuring the required security and permissions, and having an active Studio Classic instance. 1. **Amazon EC2 Instance**: You need access to an Amazon Elastic Compute Cloud (Amazon EC2) instance. Refer to AWS documentation for more information on available instance types and how to request increased quotas if needed. 2. **Security and Permissions**: Configure the necessary permissions as outlined in the security and permissions documentation. This ensures that you have the appropriate access rights to use Data Wrangler and related AWS services. 3. **Firewall Access**: If your organization uses a firewall that blocks internet traffic, ensure that you have access to the following URLs: * `https://ui.prod-1.data-wrangler.sagemaker.aws/` * `https://ui.prod-2.data-wrangler.sagemaker.aws/` * `https://ui.prod-3.data-wrangler.sagemaker.aws/` * `https://ui.prod-4.data-wrangler.sagemaker.aws/` 4. **Active Studio Classic Instance**: You need an active Studio Classic instance. Follow the instructions in the Amazon SageMaker AI Domain Overview to launch a new instance if you don't already have one. Ensure that the KernelGateway application is in a 'Ready' state before proceeding.

Accessing Data Wrangler in SageMaker Studio Classic

Once you have completed the prerequisites, you can access Data Wrangler within SageMaker Studio Classic by following these steps: 1. **Log in to Studio Classic**: Use your credentials to log in to SageMaker Studio Classic. Refer to the Amazon SageMaker AI Domain Overview for more information. 2. **Select Studio**: Navigate to the Studio interface. 3. **Launch Application**: Choose 'Studio' from the application dropdown list. 4. **Go to Home**: Select the home icon to access the main dashboard. 5. **Choose Data**: Click on the 'Data' option. 6. **Select Data Wrangler**: Choose 'Data Wrangler' to launch the application. Alternatively, you can create a new Data Wrangler flow by: 1. **Selecting File**: In the top navigation bar, choose 'File'. 2. **Choosing New**: Select 'New'. 3. **Selecting Data Wrangler Flow**: Choose 'Data Wrangler Flow'. You can also rename the new directory and `.flow` file as needed. Note that the initial loading of Data Wrangler may take a few minutes, and a carousel might appear until the KernelGateway application is ready.

Exploring Data Wrangler Features: A Titanic Dataset Walkthrough

To help you understand how to use Data Wrangler, this section provides a walkthrough using the Titanic dataset. This dataset contains information about passengers on the Titanic, including their survival status, age, gender, and class. By following this walkthrough, you will learn how to import, analyze, transform, and export data using Data Wrangler. **Steps in the Walkthrough:** 1. **Open Data Wrangler Flow**: Open a new Data Wrangler flow and choose to use a sample dataset, or upload the Titanic dataset to Amazon S3 and import it into Data Wrangler. 2. **Analyze the Dataset**: Use Data Wrangler's analysis tools to explore the dataset and gain insights. 3. **Define Data Flow**: Use Data Wrangler's data transformation features to define a data flow. 4. **Export the Flow**: Export your flow to a Jupyter notebook to create a Data Wrangler job. 5. **Process Data**: Process your data and start a SageMaker training job to train an XGBoost binary classifier.

Importing and Preparing Data with Data Wrangler

You can import the Titanic dataset into Data Wrangler using one of the following methods: 1. **Import Directly from Data Wrangler Flow**: Open the flow and select 'Use Sample Dataset'. 2. **Upload to Amazon S3**: Upload the dataset to an Amazon S3 bucket and then import it into Data Wrangler. To upload the dataset to Amazon S3: 1. **Download the Titanic Dataset**: Download the Titanic dataset. 2. **Upload to S3**: Upload the dataset to an Amazon S3 bucket in the AWS region you intend to use for this demonstration. You can use the Amazon S3 console to drag and drop the file. Once the dataset is successfully uploaded to Amazon S3, you can import it into Data Wrangler: 1. **Select Import Data**: In the data flow tab, select the 'Import Data' button or the 'Import' tab. 2. **Choose Amazon S3**: Select 'Amazon S3'. 3. **Locate the Dataset**: Use the import dataset table to find the bucket where you added the Titanic dataset. Select the CSV file to open the details pane. 4. **Configure Details**: Ensure the file type is CSV and check the box indicating that the first row is the header. You can also give the dataset a friendly name, such as 'Titanic-train'. 5. **Import**: Select the 'Import' button. After importing the dataset, it will appear in the data flow tab. Double-click the node to enter the node details view, where you can add transformations or analyses.

Analyzing and Visualizing Data

Data Wrangler provides built-in transformation and visualization capabilities to analyze, clean, and transform your data. The right panel in the node details view lists all built-in transformations and a section to add custom transformations. **Creating a Data Quality and Insights Report** To gain insights into your data, create a data quality and insights report. This report helps you identify issues such as missing values and outliers. It also alerts you to potential problems like target leakage or imbalance. **Creating a Table Summary** 1. **Add Analysis**: Select the '+' next to the data type step in the data flow and choose 'Add Analysis'. 2. **Select Table Summary**: In the analysis area, choose 'Table Summary' from the dropdown list. 3. **Name the Summary**: Give the table summary a name. 4. **Preview**: Select 'Preview' to see a preview of the table. 5. **Save**: Select 'Save' to add it to your data flow. The data will be displayed under 'All Analyses'. From the statistics provided, you can make observations such as the average fare and the presence of missing values in columns like 'cabin', 'embarked', and 'age'.

Transforming Data with Data Wrangler

After analyzing your data, you can clean and prepare it for training by adding transformations to the data flow. Here are some common transformations you can perform: **Deleting Unused Columns** 1. **Add Transformation**: Select the '+' next to the data type step in the data flow and choose 'Add Transformation'. 2. **Select Manage Columns**: In the 'All Steps' column, choose 'Add Step' and then select 'Manage Columns' from the standard transformations list. Ensure 'Drop column' is selected. 3. **Choose Columns to Delete**: Select the columns you don't want to use for training, such as 'cabin', 'ticket', 'name', 'sibsp', 'parch', 'home.dest', 'boat', and 'body'. 4. **Preview and Add**: Select 'Preview' to verify the columns are removed, then select 'Add'. **Cleaning Missing Values** 1. **Select Handle Missing Values**: Choose 'Handle missing values'. 2. **Choose Drop Missing Values**: Select 'Drop missing values' for the transformer. 3. **Select Input Column**: Choose the column with missing values, such as 'age'. 4. **Preview and Add**: Select 'Preview' to see the new data frame, then select 'Add' to add the transformation to your flow. **Custom Transformations with Pandas** You can also use custom transformations with Pandas to perform more complex data manipulations. For example, you can use one-hot encoding for categorical data: ```python import pandas as pd dummies = [] cols = ['pclass','sex','embarked'] for col in cols: dummies.append(pd.get_dummies(df[col])) encoded = pd.concat(dummies, axis=1) df = pd.concat((df, encoded),axis=1) ``` **Custom Transformations with SQL** You can use SQL to select specific columns for further analysis: ```sql SELECT survived, age, fare, 1, 2, 3, female, male, C, Q, S FROM df; ```

Exporting Data Flows and Integrating with SageMaker

Once you have created your data flow, you can export it for further use. One common option is to export it to a Data Wrangler job notebook. This process automatically creates a Jupyter notebook that is configured to run a SageMaker processing job to execute your Data Wrangler data flow. **Exporting to a Data Wrangler Job Notebook** 1. **Save the Data Flow**: Select 'File' and then 'Save Data Wrangler Flow'. 2. **Return to Data Flow Tab**: Go back to the data flow tab and select the last step in your data flow. 3. **Select Export**: Choose 'Export' and then 'Amazon S3 (via Jupyter Notebook)'. This will open a Jupyter notebook. 4. **Select Kernel**: Choose any Python 3 (Data Science) kernel. 5. **Run the Notebook**: Run the cells in the notebook until you reach the 'Kick off Training SageMaker Job (optional)' section. You can monitor the status of your Data Wrangler job in the 'Processing' tab of the SageMaker AI console. You can also use Amazon CloudWatch to monitor your Data Wrangler job.

Training an XGBoost Classifier with Prepared Data

After preparing your data with Data Wrangler, you can train an XGBoost binary classifier using either a Jupyter notebook or Amazon Autopilot. Autopilot can automatically train and optimize models based on the data transformed directly from your Data Wrangler flow. **Training with a Jupyter Notebook** In the same notebook where you launched the Data Wrangler job, you can extract the prepared data and train an XGBoost binary classifier with minimal additional data preparation. 1. **Upgrade Necessary Modules**: Use pip to upgrade the necessary modules and remove the `_SUCCESS` file: ```bash ! pip install --upgrade awscli awswrangler boto sklearn ! aws s3 rm {output_path} --recursive --exclude "*" --include "*_SUCCESS*" ``` 2. **Read Data from Amazon S3**: Use awswrangler to recursively read all CSV files from the S3 prefix. Then, split the data into features and labels. ```python import awswrangler as wr df = wr.s3.read_csv(path=output_path, dataset=True) X, y = df.iloc[:,:-1],df.iloc[:,-1] ``` 3. **Create DMatrices and Perform Cross-Validation**: Create DMatrices (the native data structure for XGBoost) and use XGBoost binary classification for cross-validation. ```python import xgboost as xgb dmatrix = xgb.DMatrix(data=X, label=y) params = {"objective":"binary:logistic",'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10} xgb.cv( dtrain=dmatrix, params=params, nfold=3, num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123) ```

Updating and Closing Data Wrangler

To ensure you have the latest features and updates, it is recommended to regularly update the Data Wrangler Studio Classic application. To update, refer to the documentation on closing and updating Studio Classic applications. Once you have finished using Data Wrangler, it is advisable to close the running instances to avoid incurring additional costs. Refer to the documentation on closing Data Wrangler for instructions on how to shut down the application and associated instances.

 Original link: https://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/data-wrangler-getting-started.html

Comment(0)

user's avatar

      Related Tools