Advanced Data Cleaning Techniques with Pandas

Introduction

Scraped data is inherently messy. Inconsistent formats, missing values, duplicates, and encoding issues are the norm, not the exception. This guide covers the advanced Pandas techniques I use to transform raw scraped data into clean, analysis-ready datasets.

The Data Cleaning Pipeline

Every cleaning project follows a similar structure. Create a reusable DataCleaningPipeline class that stores the original dataframe shape for comparison, maintains a cleaning log for transparency, and executes steps in order: remove duplicates, handle missing values, standardize text, parse dates, parse numbers, and validate data.

Handling Duplicates Intelligently

Not all duplicates are created equal. Sometimes you want exact duplicates removed, other times you need fuzzy matching. Use fuzzy string matching with a configurable threshold (e.g., 85% similarity) to identify near-duplicate rows that might have slight variations in spelling or formatting.

Smart Missing Value Handling

Missing values require context-aware handling. Different strategies work for different columns:

Mean/Median: For numeric columns with normal distributions
Mode: For categorical columns
Forward/Backward fill: For time series data
Zero: When missing means zero (e.g., counts)
Unknown string: For text fields where missing is meaningful
Drop: When missing data invalidates the entire row

Text Standardization

Scraped text often needs extensive cleaning:

Normalize unicode characters to their ASCII equivalents
Remove extra whitespace and control characters
Standardize quotes and dashes to consistent characters
Extract structured data like emails and phone numbers using regex
Apply consistent case formatting (title, upper, lower, sentence)

Parsing Complex Data Types

Scraped data often contains strings that should be numbers or dates:

Price Parsing

Handle various price formats including US format (1,234.56) and European format (1.234,56). Remove currency symbols, detect the format automatically, and convert to float. Return NaN for unparseable values.

Date Parsing

Attempt multiple date formats in sequence: ISO format (YYYY-MM-DD), various regional formats (DD/MM/YYYY, MM/DD/YYYY), and human-readable formats (December 1, 2024). Return NaT (Not a Time) for failures.

Data Validation

After cleaning, validate your data. Check that required columns are present, verify data types match expectations, ensure values fall within acceptable ranges, and validate relationships between columns (e.g., end date after start date).

Conclusion

Data cleaning is often the most time-consuming part of any data project, but it is also the most important. Clean data leads to accurate analysis and reliable insights. Invest time in building reusable cleaning pipelines—they will pay dividends across all your projects.