Data Lab Manual
AI Masterclass Practical Exercises for Manufacturing Data Analysis
Module Overview
This manual provides hands-on experience with real manufacturing datasets. Each lab follows a consistent structure designed for practical learning.
- Load and explore industrial datasets using Python and pandas
- Perform statistical analysis on manufacturing process data
- Visualize torque-angle curves and vibration patterns
- Prepare datasets for machine learning applications
- Interpret analytical results in a business context
- Build basic classification models for quality prediction
- Make data-driven decisions for quality control and predictive maintenance
- Communicate analytical findings to plant operations teams
- Basic Python knowledge (or willingness to learn)
- Python 3.x environment with pandas, numpy, matplotlib, scikit-learn
- OR KNIME Analytics Platform installed
- Understanding of basic statistics (mean, standard deviation, correlation)
| Dataset | Rows | Columns | Size | Purpose |
|---|---|---|---|---|
| conditionMonitoring.csv | 2,000 | 69 | 1.6 MB | Vibration analysis |
| processTemperature.xlsx | 200 | 5 | Small | Thermal modeling |
| angleTorque.csv | 4,000 | 1,002 | 19.9 MB | Torque-angle curves |
| processData.csv | 4,000 | 26 | 588 KB | ML features |
| breakaway.csv | 219 | 3 | ~20 KB | Quality metrics |
Lab Exercises
Click on a lab to jump to its section
Vibration Analysis
Analyze frequency-domain vibration data from production machinery to detect equipment health issues.
BeginnerTemperature Analysis
Build regression models to predict and optimize process temperatures in manufacturing operations.
BeginnerTorque-Angle Curves
Analyze tightening operation quality through torque-angle curve feature extraction and visualization.
IntermediateML Classification
Build Random Forest classifiers for quality prediction using engineered features from process data.
IntermediateQuality Control
Apply Statistical Process Control (SPC) methods including control charts and capability indices.
BeginnerCapstone Project
Multi-dataset integration challenge combining insights from all labs.
AdvancedCondition Monitoring - Vibration Analysis
Vibration monitoring is a cornerstone of predictive maintenance in manufacturing. By analyzing frequency-domain vibration data from accelerometers mounted on production machinery, we can detect early signs of bearing wear, misalignment, imbalance, and other mechanical faults before they cause costly failures or quality defects.
- Understand frequency-domain vibration analysis
- Identify patterns across X, Y, Z acceleration axes
- Calculate statistical features for anomaly detection
- Visualize vibration spectra
- Interpret vibration signatures in manufacturing context
- Build simple anomaly detection using statistical methods
| Column Name | Type | Description | Units | Range |
|---|---|---|---|---|
| Condition | String | Machine operating state | Categorical | Off, On, etc. |
| xAcc010Hz - xAcc120Hz | Float | X-axis acceleration at frequencies 10-120 Hz | m/s² | 0-20 |
| yAcc010Hz - yAcc120Hz | Float | Y-axis acceleration at frequencies 10-120 Hz | m/s² | 0-15 |
| zAcc010Hz - zAcc120Hz | Float | Z-axis acceleration at frequencies 10-120 Hz | m/s² | 0-170 |
Step 1: Load and Explore the Data
Load the dataset and perform basic exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load the dataset
df = pd.read_csv('Dataset - conditionMonitoring.csv')
# Basic exploration
print(f"Dataset Shape: {df.shape}")
print(f"\nColumn Names:\n{df.columns.tolist()}")
print(f"\nData Types:\n{df.dtypes}")
print(f"\nFirst 5 Rows:\n{df.head()}")
print(f"\nBasic Statistics:\n{df.describe()}")Step 2: Understand the Column Structure
Separate columns by axis and extract frequencies
# Separate columns by axis
x_cols = [col for col in df.columns if col.startswith('xAcc')]
y_cols = [col for col in df.columns if col.startswith('yAcc')]
z_cols = [col for col in df.columns if col.startswith('zAcc')]
print(f"X-axis columns: {len(x_cols)}")
print(f"Y-axis columns: {len(y_cols)}")
print(f"Z-axis columns: {len(z_cols)}")
# Extract frequencies from column names
frequencies = [int(col[4:7]) for col in x_cols]
print(f"\nFrequencies (Hz): {frequencies}")
# Examine the Condition column
print(f"\nCondition values:\n{df['Condition'].value_counts()}")Step 3: Visualize Vibration Spectrum
Plot average vibration spectrum for each axis
# Plot average vibration spectrum for each axis
fig, axes = plt.subplots(3, 1, figsize=(12, 10))
# Calculate mean values for each frequency
x_means = df[x_cols].mean().values
y_means = df[y_cols].mean().values
z_means = df[z_cols].mean().values
# X-axis spectrum
axes[0].bar(frequencies, x_means, color='blue', alpha=0.7)
axes[0].set_title('X-Axis Average Vibration Spectrum')
axes[0].set_xlabel('Frequency (Hz)')
axes[0].set_ylabel('Acceleration')
# Y-axis spectrum
axes[1].bar(frequencies, y_means, color='green', alpha=0.7)
axes[1].set_title('Y-Axis Average Vibration Spectrum')
axes[1].set_xlabel('Frequency (Hz)')
axes[1].set_ylabel('Acceleration')
# Z-axis spectrum
axes[2].bar(frequencies, z_means, color='red', alpha=0.7)
axes[2].set_title('Z-Axis Average Vibration Spectrum')
axes[2].set_xlabel('Frequency (Hz)')
axes[2].set_ylabel('Acceleration')
plt.tight_layout()
plt.show()Step 4: Statistical Feature Extraction
Calculate statistical features for anomaly detection
# Calculate statistical features for each sample
def extract_vibration_features(row, axis_cols):
values = row[axis_cols].values
return {
'mean': np.mean(values),
'std': np.std(values),
'max': np.max(values),
'min': np.min(values),
'range': np.max(values) - np.min(values),
'rms': np.sqrt(np.mean(values**2))
}
# Extract features for X-axis
x_features = df.apply(lambda row: extract_vibration_features(row, x_cols), axis=1)
x_features_df = pd.DataFrame(x_features.tolist())
x_features_df.columns = ['x_' + col for col in x_features_df.columns]
print("X-axis Feature Statistics:")
print(x_features_df.describe())Step 5: Anomaly Detection Exercise
Simple anomaly detection using Z-score
from scipy import stats
# Calculate overall vibration energy (RMS across all frequencies)
df['total_rms'] = np.sqrt((df[x_cols]**2).mean(axis=1) +
(df[y_cols]**2).mean(axis=1) +
(df[z_cols]**2).mean(axis=1))
# Calculate Z-scores
df['rms_zscore'] = stats.zscore(df['total_rms'])
# Identify potential anomalies (|Z| > 2)
anomalies = df[np.abs(df['rms_zscore']) > 2]
print(f"Number of potential anomalies: {len(anomalies)}")
print(f"Anomaly percentage: {len(anomalies)/len(df)*100:.2f}%")Understanding Normal Results
- X and Y axes: Low, consistent readings (typically 2-10 m/s²)
- Z-axis: Dominated by gravity component at low frequencies (~150 m/s² at 10 Hz)
- Standard deviation: Relatively low within each frequency band (< 5 m/s² variation)
Problem Pattern Indicators
| Pattern | Cause | Action |
|---|---|---|
| Elevated readings at 1x running speed | Imbalance | Schedule balancing |
| Peaks at 2x running speed | Misalignment | Check coupling alignment |
| Multiple harmonic peaks | Bearing defect | Inspect bearing immediately |
| Broadband increase across all frequencies | Looseness | Check mounting bolts |
| Gradual increase over time | Wear progression | Monitor closely |
| Error | Cause | Solution |
|---|---|---|
FileNotFoundError | Incorrect file path | Verify file location with os.listdir() |
KeyError: 'xAcc010Hz' | Column name mismatch | Check exact column names with df.columns |
ValueError: cannot convert float NaN | Missing data | Use df.dropna() or df.fillna(0) |
Memory Error | Dataset too large | Read in chunks or use dtype specifications |
Compare vibration signatures between different machine conditions. Create a visualization showing how the frequency spectrum changes when the machine is On versus Off.
Divide the frequency spectrum into bands (10-40 Hz, 45-80 Hz, 85-120 Hz) and calculate the energy contribution of each band.
Build an Isolation Forest model to detect anomalies in the vibration data. Compare results with the Z-score method.
Key Takeaways
- Frequency-domain analysis reveals specific mechanical issues
- Different axes show different fault signatures
- Statistical features (RMS, std) enable automated monitoring
- Z-score is a simple but effective anomaly detection method
- Vibration monitoring prevents costly unplanned downtime
Process Temperature Analysis
Thermal management is critical in manufacturing processes. Understanding the relationship between power consumption, cooling systems, ambient conditions, and resulting process temperatures enables optimization of energy usage and early detection of cooling system degradation.
- Perform correlation analysis between process variables
- Build regression models to predict temperature
- Visualize multivariate relationships
- Identify optimal operating conditions
- Interpret regression coefficients in manufacturing context
| Column Name | Type | Description | Units | Range |
|---|---|---|---|---|
| id | Integer | Unique observation identifier | N/A | 1-200 |
| power_kW | Float | Electrical power consumption | kilowatts | 0.9-9.7 |
| fan_RPM | Float | Cooling fan rotational speed | RPM | 750-2720 |
| ambientTemp_C | Float | Surrounding air temperature | °C | 20.7-22.7 |
| processTemp_C | Float | Measured process temperature | °C | 36.6-67.3 |
| Error | Cause | Solution |
|---|---|---|
openpyxl not found | Missing Excel library | pip install openpyxl |
ValueError: Input contains NaN | Missing data in features | df.dropna() before splitting |
Singular matrix | Multicollinearity | Remove highly correlated features |
Key Takeaways
- Correlation analysis reveals variable relationships
- Linear regression quantifies feature impacts
- Coefficient interpretation connects to physical meaning
- Model optimization enables energy savings
- R² and RMSE measure prediction quality
Torque-Angle Curve Analysis
In automotive and precision assembly, tightening operations must meet exact specifications. Torque-angle curves capture the complete signature of each tightening, enabling detection of defects that simple torque-only measurements miss.
- Load and visualize high-dimensional curve data
- Extract meaningful features from time-series curves
- Identify quality patterns in tightening operations
- Compare good vs defective tightening signatures
- Build classification models for quality prediction
| Column Name | Type | Description | Units | Range |
|---|---|---|---|---|
| Result | String | Quality outcome | Categorical | OK, NOK |
| MaxTorque | Float | Maximum torque achieved | Nm | Varies |
| Torque_0 to Torque_999 | Float | Torque at each angle step | Nm | 0-max |
| Error | Cause | Solution |
|---|---|---|
Memory Error | Large dataset (19.9 MB) | Read in chunks: pd.read_csv(..., chunksize=1000) |
Slow plotting | Too many curves | Sample subset: df.sample(100) |
Key Takeaways
- Torque-angle curves reveal hidden quality issues
- Feature extraction reduces dimensionality
- Curve shape analysis enables classification
- Real-time monitoring prevents escapes
- High-dimensional data requires careful handling
Engineered Features Classification
Machine learning classification enables automated quality prediction based on process parameters. This lab covers the complete workflow from feature analysis through model evaluation, with emphasis on handling imbalanced datasets common in manufacturing.
- Analyze pre-engineered feature sets
- Handle imbalanced classification problems
- Build and tune Random Forest models
- Interpret feature importance
- Evaluate with appropriate metrics
| Column Name | Type | Description | Units | Range |
|---|---|---|---|---|
| Result | String | Quality outcome | Categorical | OK, NOK |
| Feature_1 to Feature_25 | Float | Engineered process features | Various | Normalized |
| Error | Cause | Solution |
|---|---|---|
Class imbalance warning | Few NOK samples | Use class_weight='balanced' |
Overfitting | Complex model | Increase min_samples_leaf, reduce max_depth |
Key Takeaways
- Class imbalance is common in quality data
- Random Forest handles non-linear relationships
- Feature importance guides process improvement
- Precision-recall tradeoff affects business decisions
- Cross-validation ensures robust evaluation
Breakaway Torque Quality Control
Statistical Process Control is fundamental to manufacturing quality. Control charts detect process shifts, while capability indices quantify how well a process meets specifications. This lab applies these classic methods to breakaway torque measurements.
- Create and interpret control charts
- Calculate process capability indices (Cp, Cpk)
- Apply Western Electric run rules
- Identify special cause variation
- Make process improvement recommendations
| Column Name | Type | Description | Units | Range |
|---|---|---|---|---|
| Sample | Integer | Sample identifier | N/A | 1-219 |
| Torque | Float | Measured breakaway torque | Nm | Varies |
| Specification | String | Pass/Fail status | Categorical | Pass, Fail |
| Error | Cause | Solution |
|---|---|---|
Negative Cpk | Mean outside spec limits | Process centering required |
Control limits too wide | High variation | Investigate assignable causes |
Key Takeaways
- Control charts detect process shifts early
- Cpk measures process capability vs specifications
- Run rules identify non-random patterns
- SPC enables proactive quality management
- Data-driven decisions improve consistency
Capstone Project
Multi-Dataset Integration Challenge
You are a data analyst at a production facility. Management has requested a comprehensive quality dashboard that integrates vibration monitoring, temperature control, and tightening quality data.
- Load at least 3 datasets from the lab exercises
- Create a unified analysis combining multiple data sources
- Build at least one predictive model
- Generate visualizations for executive presentation
- Document findings and recommendations
| Criterion | Weight | Description |
|---|---|---|
| Data Quality | 20% | Proper handling of missing values, outliers, data types |
| Analysis Depth | 30% | Meaningful insights from each dataset |
| Integration | 25% | Connections between different data sources |
| Presentation | 25% | Clear visualizations and documentation |