Mastering Exploratory Data Analysis for the AWS Machine Learning Specialty Exam: Part-II -EDA

Introduction:
Welcome back to our comprehensive guide series for mastering the AWS Machine Learning Specialty exam. In our first blog, we delved into Domain 1: Data Engineering. Now, we turn our focus to Domain 2: Exploratory Data Analysis (EDA). This domain is crucial as it involves preparing, analyzing, and visualizing data, which are essential steps before modeling. Understanding these concepts will significantly enhance your ability to work with machine learning datasets effectively. EDA is a critical step in the machine learning workflow. It involves examining and visualizing data to understand its structure, detect anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations. In this blog, we’ll cover everything you need to know about Domain 2: Exploratory Data Analysis for the AWS Machine Learning Specialty exam. This includes data sanitization, feature engineering, and data analysis and visualization using AWS tools.

Sanitize and Prepare Data for Modeling:
Sanitizing data involves cleaning and preprocessing it to ensure it’s in the best shape for analysis and modeling. This process includes handling missing values, removing duplicates, and managing outliers.
Sanitizing Data:
Sanitizing data involves cleaning and preprocessing raw data to ensure it is ready for analysis and modeling. Real-world data often contains noise, missing values, and inconsistencies that can negatively impact model performance.
Techniques:
- Handling Missing Values: Missing values can be managed by imputing them with the mean, median, or mode, or by removing rows/columns containing them.
- Mean/Mode Imputation: Replace missing values with the mean (for numerical data) or mode (for categorical data).
- Forward/Backward Fill: Fill in missing values using the previous or next values.
- Deletion: Remove rows or columns with significant missing values if they are not critical for analysis.
Example: Suppose you have a customer transaction dataset with missing values in the “Age” column. You can use mean imputation to fill in the missing ages with the average age of the dataset.
import pandas as pd
data = pd.read_csv('customer_purchases.csv')
data['age'].fillna(data['age'].mean(), inplace=True)
data['income'].fillna(data['income'].median(), inplace=True)
2. Removing Duplicates: Ensure there are no redundant entries that could skew the analysis results. Duplicate records can skew the results, so identifying and removing them is essential.
Example: In an online retail dataset, you might find duplicate entries for the same transaction due to data collection errors. Removing these duplicates ensures an accurate sales analysis.
data.drop_duplicates(inplace=True)
3. Outlier Detection: Identify and handle outliers using statistical methods like the Z-score or interquartile range (IQR). Outliers can be treated by capping, flooring, or using more sophisticated techniques such as clustering.
Example: In a dataset of housing prices, an extremely high or low price compared to the median might be an outlier. Using the IQR method, you can identify these outliers and decide whether to remove or transform them.
Example:
q_low = data["income"].quantile(0.01)
q_high = data["income"].quantile(0.99)
data = data[(data["income"] > q_low) & (data["income"] < q_high)]
Perform Feature Engineering:
Feature engineering involves creating new features from existing ones to improve the performance of machine learning models. This includes creating interaction terms, polynomial features, or aggregating features.
Creating New Features: For a dataset containing transaction timestamps, you might extract new features such as the day of the week, hour of the day, and whether the transaction occurred on a weekend.
Example: Generate new features from existing ones.
data['transaction_date'] = pd.to_datetime(data['transaction_date'])
data['day_of_week'] = data['transaction_date'].dt.dayofweek
data['hour_of_day'] = data['transaction_date'].dt.hour
data['is_weekend'] = data['day_of_week'].isin([5, 6]).astype(int)
Transforming Features: Feature transformation includes scaling, normalizing, and encoding categorical variables. Tools like AWS Glue, Amazon SageMaker, and Data Wrangler make these tasks easier.
Example:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
scaler = StandardScaler()
data['income_scaled'] = scaler.fit_transform(data[['income']])
encoder = OneHotEncoder(sparse=False)
encoded_features = encoder.fit_transform(data[['day_of_week']])
data = pd.concat([data, pd.DataFrame(encoded_features, columns=encoder.get_feature_names(['day_of_week']))], axis=1)
Analyze and Visualize Data for Machine Learning:
Analyzing Data: Descriptive statistics and correlation analysis are essential for understanding data distributions and relationships between variables.
Descriptive Statistics: Include measures such as mean, median, mode, standard deviation, and percentiles.
Example:
print(data.describe())
Correlation Analysis: This helps in understanding the linear relationship between variables using correlation coefficients.
Example:
correlation_matrix = data.corr()
print(correlation_matrix)
Visualizing Data: Visualization tools are vital for EDA. AWS provides several services to facilitate data visualization:
Amazon QuickSight:
For creating interactive dashboards and visualizations.

SageMaker Studio and SageMaker Data Wrangler:
For detailed data exploration and visualization within Jupyter notebooks.
Example: Using Amazon SageMaker, you can visualize data distributions and relationships:
import seaborn as sns
import matplotlib.pyplot as plt
# Visualize distribution of income
sns.histplot(data['income'], kde=True)
plt.title('Income Distribution')
plt.show()
# Visualize relationship between age and income
sns.scatterplot(x='age', y='income', data=data)
plt.title('Age vs. Income')
plt.show()
AWS Tools for Exploratory Data Analysis:

Amazon SageMaker:
- SageMaker Studio: An integrated development environment for data exploration and visualization.
- SageMaker Data Wrangler: Simplifies data preparation and feature engineering.
AWS Glue:
- AWS Glue DataBrew: A visual data preparation tool for cleaning and normalizing data.
- AWS Glue ETL Jobs: Automates data transformation tasks.
Amazon Athena:
- Athena: Allows querying data in Amazon S3 using SQL.

Amazon Redshift:
- Redshift: For running complex queries on large datasets.
AWS Cloud9:
- Cloud9: An IDE for developing data analysis scripts.
AWS Lambda:
- Lambda: For running small data preprocessing and transformation scripts.
Amazon EMR:
- EMR (Elastic MapReduce): Provides a managed Hadoop framework for large-scale data processing.
Amazon RDS:
- RDS (Relational Database Service): For storing structured data and running SQL queries.
Amazon Kinesis:
- Kinesis Data Streams: For capturing, processing, and analyzing real-time streaming data.
- Kinesis Analytics: For running real-time analytics on data streams using SQL.
- Kinesis Firehose: For loading streaming data into AWS data stores.
Example Using Amazon Kinesis:
import boto3
# Create a Kinesis client
kinesis = boto3.client('kinesis', region_name='us-west-2')
# Put a data record into the stream
response = kinesis.put_record(
StreamName='my-data-stream',
Data=b'Hello, Kinesis!',
PartitionKey='partitionkey'
)
print(response)
Key Takeaways for Readers:
- Data Cleaning: Learn techniques for handling missing values, removing duplicates, and managing outliers to ensure your data is ready for analysis.
- Feature Engineering: Understand how to create new features and transform existing ones to enhance your machine learning models.
- Data Analysis: Utilize descriptive statistics and correlation analysis to gain insights into your data.
- Data Visualization: Leverage AWS tools like Amazon QuickSight, SageMaker Studio, and Data Wrangler for effective data visualization.
- AWS Tools: Familiarize yourself with a variety of AWS tools that support EDA, including AWS Glue, Amazon Athena, Amazon Redshift, AWS Cloud9, AWS Lambda, Amazon EMR, and Amazon Kinesis.
Conclusion:
EDA is a foundational aspect of machine learning that involves cleaning, transforming, analyzing, and visualizing data. AWS provides a comprehensive suite of tools to support EDA, making it easier to prepare and understand your data. By mastering these tools and techniques, you’ll be well-prepared for the AWS Machine Learning Specialty exam and equipped to handle real-world machine learning projects.
References:
- Amazon SageMaker: Amazon SageMaker Documentation
- AWS Glue: AWS Glue Documentation
- Amazon QuickSight: Amazon QuickSight Documentation
- Amazon Athena: Amazon Athena Documentation
- Amazon Redshift: Amazon Redshift Documentation
- AWS Cloud9: AWS Cloud9 Documentation
- AWS Lambda: AWS Lambda Documentation
- Amazon EMR: Amazon EMR Documentation
- Amazon RDS: Amazon RDS Documentation
- Amazon Kinesis: Amazon Kinesis Documentation
Stay tuned for our next blog, where we’ll dive into Domain 3: Modeling, covering techniques and best practices for building effective machine learning models. Happy studying!