Data Quality and Validation: Because Garbage In Equals Garbage Out
Let me tell you about the time I spent three weeks debugging a machine learning model that was giving completely wrong predictions. Turns out, the CSV file I was using had customer ages like “35.0000000001” instead of “35” because of a floating-point precision error in the export. The model was treating those tiny decimal differences as meaningful, and the whole analysis was garbage.
That’s when I learned the golden rule of data engineering: garbage in, garbage out. It doesn’t matter how sophisticated your analysis is or how clean your code looks - if your input data is wrong, your results will be wrong. Period.
Data quality isn’t glamorous work, but it’s absolutely critical. You’re going to spend more time cleaning and validating data than you spend on the “fun” parts of data engineering. Embrace it, because this is where you add the most value.
The Reality of Real-World Data
Academic datasets are clean and perfect. Real-world data is a disaster. Here’s what you’ll actually encounter:
- Missing values everywhere - Empty cells, “NULL” strings, spaces, zeros that should be missing
- Inconsistent formats - “USA”, “US”, “United States”, “america” all referring to the same country
- Data type confusion - Numbers stored as text, dates as random strings, booleans as “Y”/“N”
- Duplicates with slight variations - “John Smith” and “Jon Smith” are probably the same person
- Outliers and impossible values - People with ages of 150 or negative salaries
- Encoding issues - Special characters showing up as question marks or weird symbols
The sooner you accept that all data is messy, the better equipped you’ll be to handle it.
Inspecting Data Quality: Your Detective Tools
Before you do anything else with a dataset, you need to understand what you’re working with. Pandas gives you powerful tools to inspect data quality:
1import pandas as pd
2import numpy as np
3
4def data_quality_report(df):
5 """Generate a comprehensive data quality report"""
6
7 print("=== DATA QUALITY REPORT ===")
8 print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
9 print()
10
11 # Basic info about the dataset
12 print("=== COLUMN OVERVIEW ===")
13 print(df.info())
14 print()
15
16 # Missing data analysis
17 print("=== MISSING DATA ===")
18 missing_data = df.isnull().sum()
19 missing_pct = (missing_data / len(df)) * 100
20
21 missing_report = pd.DataFrame({
22 'Missing Count': missing_data,
23 'Missing Percentage': missing_pct.round(2)
24 })
25 print(missing_report[missing_report['Missing Count'] > 0])
26 print()
27
28 # Duplicate analysis
29 print("=== DUPLICATES ===")
30 total_duplicates = df.duplicated().sum()
31 print(f"Total duplicate rows: {total_duplicates}")
32 print(f"Percentage duplicated: {(total_duplicates/len(df)*100):.2f}%")
33 print()
34
35 # Data type issues
36 print("=== POTENTIAL DATA TYPE ISSUES ===")
37 for col in df.columns:
38 if df[col].dtype == 'object': # Text columns
39 # Check if numeric values are stored as text
40 try:
41 numeric_conversion = pd.to_numeric(df[col], errors='coerce')
42 non_null_original = df[col].notna().sum()
43 non_null_converted = numeric_conversion.notna().sum()
44
45 if non_null_converted > non_null_original * 0.8: # 80% convertible
46 print(f"{col}: Appears to contain numeric data stored as text")
47 except:
48 pass
49 print()
50
51 # Statistical summary for numeric columns
52 print("=== NUMERIC COLUMN STATISTICS ===")
53 numeric_cols = df.select_dtypes(include=[np.number]).columns
54 if len(numeric_cols) > 0:
55 print(df[numeric_cols].describe())
56
57# Run the quality report on your data
58# data_quality_report(your_dataframe)
This function gives you a comprehensive overview of your data quality issues. Run it on every dataset before you start processing.
Handling Missing Data: The Art of Educated Guessing
Missing data is everywhere, and how you handle it can make or break your analysis. There’s no one-size-fits-all solution, but here are the common strategies:
Strategy 1: Remove Missing Data
1# Remove rows with any missing values
2df_clean = df.dropna()
3
4# Remove rows where specific important columns are missing
5df_clean = df.dropna(subset=['customer_id', 'purchase_amount'])
6
7# Remove columns that are mostly empty
8threshold = len(df) * 0.5 # Keep columns with at least 50% data
9df_clean = df.dropna(thresh=threshold, axis=1)
Use this when: You have plenty of data and the missing values are random. Don’t use this if missing data follows a pattern or if you can’t afford to lose rows.
Strategy 2: Fill Missing Values with Sensible Defaults
1# Fill with statistical measures
2df['age'] = df['age'].fillna(df['age'].median()) # Numbers: use median
3df['income'] = df['income'].fillna(df['income'].mean()) # Or mean if normally distributed
4
5# Fill with mode (most common value)
6df['category'] = df['category'].fillna(df['category'].mode()[0])
7
8# Fill with business logic
9df['status'] = df['status'].fillna('Unknown')
10df['last_login'] = df['last_login'].fillna(pd.Timestamp('1900-01-01')) # Sentinel value
11
12# Forward fill or backward fill (for time series)
13df['price'] = df['price'].fillna(method='ffill') # Use last known value
Strategy 3: Impute Based on Other Columns
1# Fill based on groups
2df['salary'] = df.groupby(['department', 'role'])['salary'].transform(
3 lambda x: x.fillna(x.median())
4)
5
6# More complex imputation
7def impute_age_by_job(row):
8 if pd.isna(row['age']):
9 job_ages = df[df['job_title'] == row['job_title']]['age']
10 return job_ages.median() if not job_ages.empty else df['age'].median()
11 return row['age']
12
13df['age'] = df.apply(impute_age_by_job, axis=1)
Data Type Validation and Conversion
Pandas tries to guess data types when reading files, but it’s often wrong. Here’s how to fix it:
1def validate_and_fix_types(df):
2 """Clean up common data type issues"""
3
4 df = df.copy() # Don't modify original
5
6 # Fix numeric columns stored as text
7 numeric_cols = ['price', 'quantity', 'age', 'income']
8 for col in numeric_cols:
9 if col in df.columns:
10 # Remove common non-numeric characters
11 df[col] = df[col].astype(str).str.replace('[$,]', '', regex=True)
12 df[col] = pd.to_numeric(df[col], errors='coerce')
13
14 # Fix date columns
15 date_cols = ['created_date', 'updated_date', 'birth_date']
16 for col in date_cols:
17 if col in df.columns:
18 df[col] = pd.to_datetime(df[col], errors='coerce')
19
20 # Fix boolean columns
21 boolean_cols = ['is_active', 'email_verified', 'premium_member']
22 for col in boolean_cols:
23 if col in df.columns:
24 df[col] = df[col].map({
25 'Y': True, 'N': False, 'yes': True, 'no': False,
26 'true': True, 'false': False, '1': True, '0': False,
27 1: True, 0: False, True: True, False: False
28 })
29
30 # Use categorical for repeated text values (saves memory)
31 text_cols = df.select_dtypes(include=['object']).columns
32 for col in text_cols:
33 if col not in date_cols: # Don't convert dates to categorical
34 unique_ratio = df[col].nunique() / len(df)
35 if unique_ratio < 0.1: # Less than 10% unique values
36 df[col] = df[col].astype('category')
37
38 return df
39
40# Clean up your data types
41df_clean = validate_and_fix_types(df)
Detecting and Handling Duplicates
Duplicate data is sneaky because it’s rarely exact duplicates. Here’s how to find and fix the tricky ones:
1# Exact duplicates (easy case)
2exact_duplicates = df.duplicated().sum()
3df_no_exact = df.drop_duplicates()
4
5# Fuzzy duplicates (harder case)
6def find_similar_records(df, columns, threshold=0.8):
7 """Find records that are probably duplicates but not exactly the same"""
8 from difflib import SequenceMatcher
9
10 potential_dupes = []
11
12 for col in columns:
13 if col not in df.columns:
14 continue
15
16 # Convert to string and clean up
17 values = df[col].astype(str).str.lower().str.strip()
18
19 for i, val1 in enumerate(values):
20 for j, val2 in enumerate(values[i+1:], i+1):
21 similarity = SequenceMatcher(None, val1, val2).ratio()
22 if similarity > threshold:
23 potential_dupes.append({
24 'index1': i,
25 'index2': j,
26 'column': col,
27 'value1': val1,
28 'value2': val2,
29 'similarity': similarity
30 })
31
32 return pd.DataFrame(potential_dupes)
33
34# Find potential name duplicates
35name_dupes = find_similar_records(df, ['customer_name'], threshold=0.85)
36
37# More sophisticated duplicate detection based on multiple columns
38def identify_customer_duplicates(df):
39 """Find customers that are probably the same person"""
40
41 # Standardize key fields first
42 df = df.copy()
43 df['clean_email'] = df['email'].str.lower().str.strip()
44 df['clean_phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
45 df['clean_name'] = df['name'].str.lower().str.strip()
46
47 # Flag potential duplicates
48 duplicates = df.groupby(['clean_email']).filter(lambda x: len(x) > 1)
49 phone_dupes = df.groupby(['clean_phone']).filter(lambda x: len(x) > 1)
50
51 return duplicates, phone_dupes
Outlier Detection: Finding the Weird Stuff
Outliers can be legitimate extreme values or data entry errors. Here’s how to detect them:
1def detect_outliers(df, columns=None):
2 """Detect outliers using multiple methods"""
3
4 if columns is None:
5 columns = df.select_dtypes(include=[np.number]).columns
6
7 outlier_report = {}
8
9 for col in columns:
10 if col not in df.columns:
11 continue
12
13 values = df[col].dropna()
14 outlier_indices = set()
15
16 # Method 1: IQR (Interquartile Range)
17 Q1 = values.quantile(0.25)
18 Q3 = values.quantile(0.75)
19 IQR = Q3 - Q1
20 lower_bound = Q1 - 1.5 * IQR
21 upper_bound = Q3 + 1.5 * IQR
22
23 iqr_outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
24 outlier_indices.update(iqr_outliers)
25
26 # Method 2: Z-score
27 z_scores = np.abs((values - values.mean()) / values.std())
28 z_outliers = df[z_scores > 3].index # More than 3 standard deviations
29 outlier_indices.update(z_outliers)
30
31 # Method 3: Business logic (customize for your domain)
32 if col == 'age':
33 age_outliers = df[(df[col] < 0) | (df[col] > 120)].index
34 outlier_indices.update(age_outliers)
35 elif col == 'price':
36 price_outliers = df[df[col] < 0].index # Negative prices are usually wrong
37 outlier_indices.update(price_outliers)
38
39 outlier_report[col] = {
40 'count': len(outlier_indices),
41 'percentage': len(outlier_indices) / len(df) * 100,
42 'indices': list(outlier_indices)
43 }
44
45 return outlier_report
46
47# Find outliers
48outliers = detect_outliers(df, ['age', 'income', 'purchase_amount'])
49
50# Review outliers before deciding what to do
51for col, info in outliers.items():
52 print(f"\n{col}: {info['count']} outliers ({info['percentage']:.2f}%)")
53 if info['count'] > 0:
54 outlier_rows = df.loc[info['indices'], [col]]
55 print(outlier_rows.head())
Data Consistency Validation
Inconsistent data formats are everywhere. Here’s how to standardize common fields:
1def standardize_data_formats(df):
2 """Standardize common data format issues"""
3
4 df = df.copy()
5
6 # Standardize country names
7 country_mapping = {
8 'usa': 'United States',
9 'us': 'United States',
10 'america': 'United States',
11 'uk': 'United Kingdom',
12 'britain': 'United Kingdom'
13 }
14
15 if 'country' in df.columns:
16 df['country'] = df['country'].str.lower().str.strip()
17 df['country'] = df['country'].replace(country_mapping)
18 df['country'] = df['country'].str.title()
19
20 # Standardize phone numbers
21 if 'phone' in df.columns:
22 # Remove all non-digits
23 df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
24 # Format as XXX-XXX-XXXX for US numbers
25 us_phone_mask = df['phone'].str.len() == 10
26 df.loc[us_phone_mask, 'phone'] = df.loc[us_phone_mask, 'phone'].str.replace(
27 r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True
28 )
29
30 # Standardize email addresses
31 if 'email' in df.columns:
32 df['email'] = df['email'].str.lower().str.strip()
33 # Flag potentially invalid emails
34 df['email_valid'] = df['email'].str.contains(
35 r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$',
36 regex=True, na=False
37 )
38
39 # Standardize names
40 name_cols = ['first_name', 'last_name', 'name']
41 for col in name_cols:
42 if col in df.columns:
43 df[col] = df[col].str.strip().str.title()
44 # Remove extra spaces
45 df[col] = df[col].str.replace(r'\s+', ' ', regex=True)
46
47 return df
48
49# Standardize your data
50df_standard = standardize_data_formats(df)
Cross-Field Validation: Checking Business Logic
Sometimes individual fields look fine, but combinations don’t make sense:
1def validate_business_rules(df):
2 """Check for business logic violations"""
3
4 validation_errors = []
5
6 # Check date logic
7 if 'birth_date' in df.columns and 'age' in df.columns:
8 calculated_age = (pd.Timestamp.now() - pd.to_datetime(df['birth_date'])).dt.days // 365
9 age_mismatch = abs(calculated_age - df['age']) > 1
10
11 if age_mismatch.any():
12 validation_errors.append({
13 'rule': 'age_birth_date_mismatch',
14 'count': age_mismatch.sum(),
15 'indices': df[age_mismatch].index.tolist()
16 })
17
18 # Check financial logic
19 if all(col in df.columns for col in ['quantity', 'unit_price', 'total_amount']):
20 expected_total = df['quantity'] * df['unit_price']
21 total_mismatch = abs(expected_total - df['total_amount']) > 0.01
22
23 if total_mismatch.any():
24 validation_errors.append({
25 'rule': 'total_calculation_error',
26 'count': total_mismatch.sum(),
27 'indices': df[total_mismatch].index.tolist()
28 })
29
30 # Check status consistency
31 if all(col in df.columns for col in ['order_date', 'ship_date']):
32 ship_before_order = pd.to_datetime(df['ship_date']) < pd.to_datetime(df['order_date'])
33
34 if ship_before_order.any():
35 validation_errors.append({
36 'rule': 'ship_date_before_order_date',
37 'count': ship_before_order.sum(),
38 'indices': df[ship_before_order].index.tolist()
39 })
40
41 return validation_errors
42
43# Check business rules
44rule_violations = validate_business_rules(df)
45for error in rule_violations:
46 print(f"Rule violation: {error['rule']}")
47 print(f"Count: {error['count']}")
48 print("Sample problematic rows:")
49 print(df.loc[error['indices'][:5]]) # Show first 5 problematic rows
50 print()
Building a Complete Data Validation Pipeline
Here’s how to put it all together into a reusable validation framework:
1class DataValidator:
2 """Comprehensive data validation class"""
3
4 def __init__(self, df):
5 self.df = df.copy()
6 self.original_shape = df.shape
7 self.validation_log = []
8
9 def log_step(self, step_name, before_count, after_count, description):
10 """Log validation steps"""
11 self.validation_log.append({
12 'step': step_name,
13 'before_rows': before_count,
14 'after_rows': after_count,
15 'rows_affected': before_count - after_count,
16 'description': description
17 })
18
19 def remove_empty_rows(self, required_columns=None):
20 """Remove rows that are completely or mostly empty"""
21 before_count = len(self.df)
22
23 if required_columns:
24 self.df = self.df.dropna(subset=required_columns)
25 else:
26 # Remove rows that are more than 80% empty
27 threshold = len(self.df.columns) * 0.2
28 self.df = self.df.dropna(thresh=threshold)
29
30 after_count = len(self.df)
31 self.log_step('remove_empty_rows', before_count, after_count,
32 f"Removed rows missing required data")
33
34 return self
35
36 def remove_duplicates(self, subset=None):
37 """Remove duplicate rows"""
38 before_count = len(self.df)
39 self.df = self.df.drop_duplicates(subset=subset)
40 after_count = len(self.df)
41
42 self.log_step('remove_duplicates', before_count, after_count,
43 f"Removed exact duplicate rows")
44
45 return self
46
47 def fix_data_types(self, type_mapping=None):
48 """Fix common data type issues"""
49 before_count = len(self.df)
50
51 if type_mapping:
52 for col, dtype in type_mapping.items():
53 if col in self.df.columns:
54 if dtype == 'datetime':
55 self.df[col] = pd.to_datetime(self.df[col], errors='coerce')
56 elif dtype == 'numeric':
57 self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
58 else:
59 self.df[col] = self.df[col].astype(dtype)
60
61 # Remove rows where type conversion failed
62 self.df = self.df.dropna()
63 after_count = len(self.df)
64
65 self.log_step('fix_data_types', before_count, after_count,
66 f"Fixed data types and removed invalid conversions")
67
68 return self
69
70 def remove_outliers(self, columns, method='iqr'):
71 """Remove outliers from specified columns"""
72 before_count = len(self.df)
73
74 for col in columns:
75 if col not in self.df.columns:
76 continue
77
78 if method == 'iqr':
79 Q1 = self.df[col].quantile(0.25)
80 Q3 = self.df[col].quantile(0.75)
81 IQR = Q3 - Q1
82 lower_bound = Q1 - 1.5 * IQR
83 upper_bound = Q3 + 1.5 * IQR
84
85 self.df = self.df[
86 (self.df[col] >= lower_bound) & (self.df[col] <= upper_bound)
87 ]
88
89 after_count = len(self.df)
90 self.log_step('remove_outliers', before_count, after_count,
91 f"Removed outliers from {columns}")
92
93 return self
94
95 def get_cleaned_data(self):
96 """Return the cleaned DataFrame"""
97 return self.df
98
99 def get_validation_report(self):
100 """Return a summary of all validation steps"""
101 report = pd.DataFrame(self.validation_log)
102
103 print("=== DATA VALIDATION REPORT ===")
104 print(f"Original shape: {self.original_shape}")
105 print(f"Final shape: {self.df.shape}")
106 print(f"Total rows removed: {self.original_shape[0] - self.df.shape[0]}")
107 print(f"Data retention: {self.df.shape[0]/self.original_shape[0]*100:.1f}%")
108 print("\nStep-by-step breakdown:")
109 print(report)
110
111 return report
112
113# Use the validation pipeline
114validator = DataValidator(raw_data)
115clean_data = (validator
116 .remove_empty_rows(required_columns=['customer_id', 'amount'])
117 .remove_duplicates()
118 .fix_data_types({
119 'order_date': 'datetime',
120 'amount': 'numeric',
121 'age': 'numeric'
122 })
123 .remove_outliers(['amount', 'age'])
124 .get_cleaned_data())
125
126# Get the validation report
127validation_report = validator.get_validation_report()
Creating Data Quality Checks That Run Automatically
The best validation happens automatically as part of your pipeline:
1def create_data_quality_tests(df, config):
2 """Run automated data quality tests"""
3
4 tests_passed = []
5 tests_failed = []
6
7 # Test 1: Required columns exist
8 required_cols = config.get('required_columns', [])
9 missing_cols = [col for col in required_cols if col not in df.columns]
10
11 if missing_cols:
12 tests_failed.append(f"Missing required columns: {missing_cols}")
13 else:
14 tests_passed.append("All required columns present")
15
16 # Test 2: No excessive missing data
17 max_missing_pct = config.get('max_missing_percentage', 50)
18 for col in df.columns:
19 missing_pct = (df[col].isnull().sum() / len(df)) * 100
20 if missing_pct > max_missing_pct:
21 tests_failed.append(f"{col} has {missing_pct:.1f}% missing data (max: {max_missing_pct}%)")
22
23 # Test 3: Data type validation
24 expected_types = config.get('expected_types', {})
25 for col, expected_type in expected_types.items():
26 if col in df.columns:
27 actual_type = str(df[col].dtype)
28 if expected_type not in actual_type:
29 tests_failed.append(f"{col} has type {actual_type}, expected {expected_type}")
30
31 # Test 4: Value range validation
32 value_ranges = config.get('value_ranges', {})
33 for col, (min_val, max_val) in value_ranges.items():
34 if col in df.columns:
35 out_of_range = (df[col] < min_val) | (df[col] > max_val)
36 if out_of_range.any():
37 count = out_of_range.sum()
38 tests_failed.append(f"{col} has {count} values outside range [{min_val}, {max_val}]")
39
40 # Test 5: Uniqueness validation
41 unique_columns = config.get('unique_columns', [])
42 for col in unique_columns:
43 if col in df.columns:
44 duplicates = df[col].duplicated().sum()
45 if duplicates > 0:
46 tests_failed.append(f"{col} has {duplicates} duplicate values (should be unique)")
47
48 return {
49 'passed': tests_passed,
50 'failed': tests_failed,
51 'success': len(tests_failed) == 0
52 }
53
54# Configuration for your data quality tests
55quality_config = {
56 'required_columns': ['customer_id', 'order_date', 'amount'],
57 'max_missing_percentage': 10,
58 'expected_types': {
59 'customer_id': 'object',
60 'amount': 'float',
61 'order_date': 'datetime'
62 },
63 'value_ranges': {
64 'amount': (0, 10000),
65 'age': (0, 120)
66 },
67 'unique_columns': ['transaction_id']
68}
69
70# Run the tests
71test_results = create_data_quality_tests(df, quality_config)
72
73if test_results['success']:
74 print("✅ All data quality tests passed!")
75else:
76 print("❌ Data quality issues found:")
77 for issue in test_results['failed']:
78 print(f" - {issue}")
The Big Picture: Making Data Quality Part of Your Process
Here’s what I’ve learned after years of dealing with messy data: data quality isn’t a one-time thing you do at the beginning of a project. It’s an ongoing process that needs to be built into every step of your pipeline.
Always validate early and often:
- Check data quality when you first receive it
- Validate after each transformation step
- Test your output before sending it downstream
Document your assumptions:
- What do you expect the data to look like?
- What business rules should it follow?
- What are acceptable ranges for each field?
Build reusable validation functions:
- Don’t write the same data cleaning code over and over
- Create libraries of validation functions you can reuse
- Make your validations configurable for different datasets
Monitor data quality over time:
- Set up alerts when data quality drops below thresholds
- Track quality metrics in dashboards
- Have a process for investigating quality issues
Remember, fixing data quality issues isn’t just about making your analysis work - it’s about building trust with your stakeholders. When people know they can rely on your data, they’ll use it to make important business decisions. That’s when you know you’re adding real value as a data engineer.
Clean data is the foundation of everything else you’ll do. Master these validation techniques, and you’ll save yourself countless hours of debugging later. More importantly, you’ll deliver results that people can actually trust and act on.