Data Quality and Validation: Because Garbage In Equals Garbage Out

Let me tell you about the time I spent three weeks debugging a machine learning model that was giving completely wrong predictions. Turns out, the CSV file I was using had customer ages like “35.0000000001” instead of “35” because of a floating-point precision error in the export. The model was treating those tiny decimal differences as meaningful, and the whole analysis was garbage.

That’s when I learned the golden rule of data engineering: garbage in, garbage out. It doesn’t matter how sophisticated your analysis is or how clean your code looks - if your input data is wrong, your results will be wrong. Period.

Data quality isn’t glamorous work, but it’s absolutely critical. You’re going to spend more time cleaning and validating data than you spend on the “fun” parts of data engineering. Embrace it, because this is where you add the most value.

The Reality of Real-World Data

Academic datasets are clean and perfect. Real-world data is a disaster. Here’s what you’ll actually encounter:

Missing values everywhere - Empty cells, “NULL” strings, spaces, zeros that should be missing
Inconsistent formats - “USA”, “US”, “United States”, “america” all referring to the same country
Data type confusion - Numbers stored as text, dates as random strings, booleans as “Y”/“N”
Duplicates with slight variations - “John Smith” and “Jon Smith” are probably the same person
Outliers and impossible values - People with ages of 150 or negative salaries
Encoding issues - Special characters showing up as question marks or weird symbols

The sooner you accept that all data is messy, the better equipped you’ll be to handle it.

Inspecting Data Quality: Your Detective Tools

Before you do anything else with a dataset, you need to understand what you’re working with. Pandas gives you powerful tools to inspect data quality:

 1import pandas as pd
 2import numpy as np
 3
 4def data_quality_report(df):
 5    """Generate a comprehensive data quality report"""
 6    
 7    print("=== DATA QUALITY REPORT ===")
 8    print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
 9    print()
10    
11    # Basic info about the dataset
12    print("=== COLUMN OVERVIEW ===")
13    print(df.info())
14    print()
15    
16    # Missing data analysis
17    print("=== MISSING DATA ===")
18    missing_data = df.isnull().sum()
19    missing_pct = (missing_data / len(df)) * 100
20    
21    missing_report = pd.DataFrame({
22        'Missing Count': missing_data,
23        'Missing Percentage': missing_pct.round(2)
24    })
25    print(missing_report[missing_report['Missing Count'] > 0])
26    print()
27    
28    # Duplicate analysis
29    print("=== DUPLICATES ===")
30    total_duplicates = df.duplicated().sum()
31    print(f"Total duplicate rows: {total_duplicates}")
32    print(f"Percentage duplicated: {(total_duplicates/len(df)*100):.2f}%")
33    print()
34    
35    # Data type issues
36    print("=== POTENTIAL DATA TYPE ISSUES ===")
37    for col in df.columns:
38        if df[col].dtype == 'object':  # Text columns
39            # Check if numeric values are stored as text
40            try:
41                numeric_conversion = pd.to_numeric(df[col], errors='coerce')
42                non_null_original = df[col].notna().sum()
43                non_null_converted = numeric_conversion.notna().sum()
44                
45                if non_null_converted > non_null_original * 0.8:  # 80% convertible
46                    print(f"{col}: Appears to contain numeric data stored as text")
47            except:
48                pass
49    print()
50    
51    # Statistical summary for numeric columns
52    print("=== NUMERIC COLUMN STATISTICS ===")
53    numeric_cols = df.select_dtypes(include=[np.number]).columns
54    if len(numeric_cols) > 0:
55        print(df[numeric_cols].describe())
56
57# Run the quality report on your data
58# data_quality_report(your_dataframe)

This function gives you a comprehensive overview of your data quality issues. Run it on every dataset before you start processing.

Handling Missing Data: The Art of Educated Guessing

Missing data is everywhere, and how you handle it can make or break your analysis. There’s no one-size-fits-all solution, but here are the common strategies:

Strategy 1: Remove Missing Data

1# Remove rows with any missing values
2df_clean = df.dropna()
3
4# Remove rows where specific important columns are missing
5df_clean = df.dropna(subset=['customer_id', 'purchase_amount'])
6
7# Remove columns that are mostly empty
8threshold = len(df) * 0.5  # Keep columns with at least 50% data
9df_clean = df.dropna(thresh=threshold, axis=1)

Use this when: You have plenty of data and the missing values are random. Don’t use this if missing data follows a pattern or if you can’t afford to lose rows.

Strategy 2: Fill Missing Values with Sensible Defaults

 1# Fill with statistical measures
 2df['age'] = df['age'].fillna(df['age'].median())  # Numbers: use median
 3df['income'] = df['income'].fillna(df['income'].mean())  # Or mean if normally distributed
 4
 5# Fill with mode (most common value)
 6df['category'] = df['category'].fillna(df['category'].mode()[0])
 7
 8# Fill with business logic
 9df['status'] = df['status'].fillna('Unknown')
10df['last_login'] = df['last_login'].fillna(pd.Timestamp('1900-01-01'))  # Sentinel value
11
12# Forward fill or backward fill (for time series)
13df['price'] = df['price'].fillna(method='ffill')  # Use last known value

Strategy 3: Impute Based on Other Columns

 1# Fill based on groups
 2df['salary'] = df.groupby(['department', 'role'])['salary'].transform(
 3    lambda x: x.fillna(x.median())
 4)
 5
 6# More complex imputation
 7def impute_age_by_job(row):
 8    if pd.isna(row['age']):
 9        job_ages = df[df['job_title'] == row['job_title']]['age']
10        return job_ages.median() if not job_ages.empty else df['age'].median()
11    return row['age']
12
13df['age'] = df.apply(impute_age_by_job, axis=1)

Data Type Validation and Conversion

Pandas tries to guess data types when reading files, but it’s often wrong. Here’s how to fix it:

 1def validate_and_fix_types(df):
 2    """Clean up common data type issues"""
 3    
 4    df = df.copy()  # Don't modify original
 5    
 6    # Fix numeric columns stored as text
 7    numeric_cols = ['price', 'quantity', 'age', 'income']
 8    for col in numeric_cols:
 9        if col in df.columns:
10            # Remove common non-numeric characters
11            df[col] = df[col].astype(str).str.replace('[$,]', '', regex=True)
12            df[col] = pd.to_numeric(df[col], errors='coerce')
13    
14    # Fix date columns
15    date_cols = ['created_date', 'updated_date', 'birth_date']
16    for col in date_cols:
17        if col in df.columns:
18            df[col] = pd.to_datetime(df[col], errors='coerce')
19    
20    # Fix boolean columns
21    boolean_cols = ['is_active', 'email_verified', 'premium_member']
22    for col in boolean_cols:
23        if col in df.columns:
24            df[col] = df[col].map({
25                'Y': True, 'N': False, 'yes': True, 'no': False,
26                'true': True, 'false': False, '1': True, '0': False,
27                1: True, 0: False, True: True, False: False
28            })
29    
30    # Use categorical for repeated text values (saves memory)
31    text_cols = df.select_dtypes(include=['object']).columns
32    for col in text_cols:
33        if col not in date_cols:  # Don't convert dates to categorical
34            unique_ratio = df[col].nunique() / len(df)
35            if unique_ratio < 0.1:  # Less than 10% unique values
36                df[col] = df[col].astype('category')
37    
38    return df
39
40# Clean up your data types
41df_clean = validate_and_fix_types(df)

Detecting and Handling Duplicates

Duplicate data is sneaky because it’s rarely exact duplicates. Here’s how to find and fix the tricky ones:

 1# Exact duplicates (easy case)
 2exact_duplicates = df.duplicated().sum()
 3df_no_exact = df.drop_duplicates()
 4
 5# Fuzzy duplicates (harder case)
 6def find_similar_records(df, columns, threshold=0.8):
 7    """Find records that are probably duplicates but not exactly the same"""
 8    from difflib import SequenceMatcher
 9    
10    potential_dupes = []
11    
12    for col in columns:
13        if col not in df.columns:
14            continue
15            
16        # Convert to string and clean up
17        values = df[col].astype(str).str.lower().str.strip()
18        
19        for i, val1 in enumerate(values):
20            for j, val2 in enumerate(values[i+1:], i+1):
21                similarity = SequenceMatcher(None, val1, val2).ratio()
22                if similarity > threshold:
23                    potential_dupes.append({
24                        'index1': i,
25                        'index2': j,
26                        'column': col,
27                        'value1': val1,
28                        'value2': val2,
29                        'similarity': similarity
30                    })
31    
32    return pd.DataFrame(potential_dupes)
33
34# Find potential name duplicates
35name_dupes = find_similar_records(df, ['customer_name'], threshold=0.85)
36
37# More sophisticated duplicate detection based on multiple columns
38def identify_customer_duplicates(df):
39    """Find customers that are probably the same person"""
40    
41    # Standardize key fields first
42    df = df.copy()
43    df['clean_email'] = df['email'].str.lower().str.strip()
44    df['clean_phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
45    df['clean_name'] = df['name'].str.lower().str.strip()
46    
47    # Flag potential duplicates
48    duplicates = df.groupby(['clean_email']).filter(lambda x: len(x) > 1)
49    phone_dupes = df.groupby(['clean_phone']).filter(lambda x: len(x) > 1)
50    
51    return duplicates, phone_dupes

Outlier Detection: Finding the Weird Stuff

Outliers can be legitimate extreme values or data entry errors. Here’s how to detect them:

 1def detect_outliers(df, columns=None):
 2    """Detect outliers using multiple methods"""
 3    
 4    if columns is None:
 5        columns = df.select_dtypes(include=[np.number]).columns
 6    
 7    outlier_report = {}
 8    
 9    for col in columns:
10        if col not in df.columns:
11            continue
12            
13        values = df[col].dropna()
14        outlier_indices = set()
15        
16        # Method 1: IQR (Interquartile Range)
17        Q1 = values.quantile(0.25)
18        Q3 = values.quantile(0.75)
19        IQR = Q3 - Q1
20        lower_bound = Q1 - 1.5 * IQR
21        upper_bound = Q3 + 1.5 * IQR
22        
23        iqr_outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
24        outlier_indices.update(iqr_outliers)
25        
26        # Method 2: Z-score
27        z_scores = np.abs((values - values.mean()) / values.std())
28        z_outliers = df[z_scores > 3].index  # More than 3 standard deviations
29        outlier_indices.update(z_outliers)
30        
31        # Method 3: Business logic (customize for your domain)
32        if col == 'age':
33            age_outliers = df[(df[col] < 0) | (df[col] > 120)].index
34            outlier_indices.update(age_outliers)
35        elif col == 'price':
36            price_outliers = df[df[col] < 0].index  # Negative prices are usually wrong
37            outlier_indices.update(price_outliers)
38        
39        outlier_report[col] = {
40            'count': len(outlier_indices),
41            'percentage': len(outlier_indices) / len(df) * 100,
42            'indices': list(outlier_indices)
43        }
44    
45    return outlier_report
46
47# Find outliers
48outliers = detect_outliers(df, ['age', 'income', 'purchase_amount'])
49
50# Review outliers before deciding what to do
51for col, info in outliers.items():
52    print(f"\n{col}: {info['count']} outliers ({info['percentage']:.2f}%)")
53    if info['count'] > 0:
54        outlier_rows = df.loc[info['indices'], [col]]
55        print(outlier_rows.head())

Data Consistency Validation

Inconsistent data formats are everywhere. Here’s how to standardize common fields:

 1def standardize_data_formats(df):
 2    """Standardize common data format issues"""
 3    
 4    df = df.copy()
 5    
 6    # Standardize country names
 7    country_mapping = {
 8        'usa': 'United States',
 9        'us': 'United States', 
10        'america': 'United States',
11        'uk': 'United Kingdom',
12        'britain': 'United Kingdom'
13    }
14    
15    if 'country' in df.columns:
16        df['country'] = df['country'].str.lower().str.strip()
17        df['country'] = df['country'].replace(country_mapping)
18        df['country'] = df['country'].str.title()
19    
20    # Standardize phone numbers
21    if 'phone' in df.columns:
22        # Remove all non-digits
23        df['phone'] = df['phone'].str.replace(r'[^\d]', '', regex=True)
24        # Format as XXX-XXX-XXXX for US numbers
25        us_phone_mask = df['phone'].str.len() == 10
26        df.loc[us_phone_mask, 'phone'] = df.loc[us_phone_mask, 'phone'].str.replace(
27            r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True
28        )
29    
30    # Standardize email addresses
31    if 'email' in df.columns:
32        df['email'] = df['email'].str.lower().str.strip()
33        # Flag potentially invalid emails
34        df['email_valid'] = df['email'].str.contains(
35            r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$', 
36            regex=True, na=False
37        )
38    
39    # Standardize names
40    name_cols = ['first_name', 'last_name', 'name']
41    for col in name_cols:
42        if col in df.columns:
43            df[col] = df[col].str.strip().str.title()
44            # Remove extra spaces
45            df[col] = df[col].str.replace(r'\s+', ' ', regex=True)
46    
47    return df
48
49# Standardize your data
50df_standard = standardize_data_formats(df)

Cross-Field Validation: Checking Business Logic

Sometimes individual fields look fine, but combinations don’t make sense:

 1def validate_business_rules(df):
 2    """Check for business logic violations"""
 3    
 4    validation_errors = []
 5    
 6    # Check date logic
 7    if 'birth_date' in df.columns and 'age' in df.columns:
 8        calculated_age = (pd.Timestamp.now() - pd.to_datetime(df['birth_date'])).dt.days // 365
 9        age_mismatch = abs(calculated_age - df['age']) > 1
10        
11        if age_mismatch.any():
12            validation_errors.append({
13                'rule': 'age_birth_date_mismatch',
14                'count': age_mismatch.sum(),
15                'indices': df[age_mismatch].index.tolist()
16            })
17    
18    # Check financial logic
19    if all(col in df.columns for col in ['quantity', 'unit_price', 'total_amount']):
20        expected_total = df['quantity'] * df['unit_price']
21        total_mismatch = abs(expected_total - df['total_amount']) > 0.01
22        
23        if total_mismatch.any():
24            validation_errors.append({
25                'rule': 'total_calculation_error',
26                'count': total_mismatch.sum(),
27                'indices': df[total_mismatch].index.tolist()
28            })
29    
30    # Check status consistency
31    if all(col in df.columns for col in ['order_date', 'ship_date']):
32        ship_before_order = pd.to_datetime(df['ship_date']) < pd.to_datetime(df['order_date'])
33        
34        if ship_before_order.any():
35            validation_errors.append({
36                'rule': 'ship_date_before_order_date',
37                'count': ship_before_order.sum(),
38                'indices': df[ship_before_order].index.tolist()
39            })
40    
41    return validation_errors
42
43# Check business rules
44rule_violations = validate_business_rules(df)
45for error in rule_violations:
46    print(f"Rule violation: {error['rule']}")
47    print(f"Count: {error['count']}")
48    print("Sample problematic rows:")
49    print(df.loc[error['indices'][:5]])  # Show first 5 problematic rows
50    print()

Building a Complete Data Validation Pipeline

Here’s how to put it all together into a reusable validation framework:

  1class DataValidator:
  2    """Comprehensive data validation class"""
  3    
  4    def __init__(self, df):
  5        self.df = df.copy()
  6        self.original_shape = df.shape
  7        self.validation_log = []
  8        
  9    def log_step(self, step_name, before_count, after_count, description):
 10        """Log validation steps"""
 11        self.validation_log.append({
 12            'step': step_name,
 13            'before_rows': before_count,
 14            'after_rows': after_count,
 15            'rows_affected': before_count - after_count,
 16            'description': description
 17        })
 18    
 19    def remove_empty_rows(self, required_columns=None):
 20        """Remove rows that are completely or mostly empty"""
 21        before_count = len(self.df)
 22        
 23        if required_columns:
 24            self.df = self.df.dropna(subset=required_columns)
 25        else:
 26            # Remove rows that are more than 80% empty
 27            threshold = len(self.df.columns) * 0.2
 28            self.df = self.df.dropna(thresh=threshold)
 29        
 30        after_count = len(self.df)
 31        self.log_step('remove_empty_rows', before_count, after_count, 
 32                     f"Removed rows missing required data")
 33        
 34        return self
 35    
 36    def remove_duplicates(self, subset=None):
 37        """Remove duplicate rows"""
 38        before_count = len(self.df)
 39        self.df = self.df.drop_duplicates(subset=subset)
 40        after_count = len(self.df)
 41        
 42        self.log_step('remove_duplicates', before_count, after_count,
 43                     f"Removed exact duplicate rows")
 44        
 45        return self
 46    
 47    def fix_data_types(self, type_mapping=None):
 48        """Fix common data type issues"""
 49        before_count = len(self.df)
 50        
 51        if type_mapping:
 52            for col, dtype in type_mapping.items():
 53                if col in self.df.columns:
 54                    if dtype == 'datetime':
 55                        self.df[col] = pd.to_datetime(self.df[col], errors='coerce')
 56                    elif dtype == 'numeric':
 57                        self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
 58                    else:
 59                        self.df[col] = self.df[col].astype(dtype)
 60        
 61        # Remove rows where type conversion failed
 62        self.df = self.df.dropna()
 63        after_count = len(self.df)
 64        
 65        self.log_step('fix_data_types', before_count, after_count,
 66                     f"Fixed data types and removed invalid conversions")
 67        
 68        return self
 69    
 70    def remove_outliers(self, columns, method='iqr'):
 71        """Remove outliers from specified columns"""
 72        before_count = len(self.df)
 73        
 74        for col in columns:
 75            if col not in self.df.columns:
 76                continue
 77                
 78            if method == 'iqr':
 79                Q1 = self.df[col].quantile(0.25)
 80                Q3 = self.df[col].quantile(0.75)
 81                IQR = Q3 - Q1
 82                lower_bound = Q1 - 1.5 * IQR
 83                upper_bound = Q3 + 1.5 * IQR
 84                
 85                self.df = self.df[
 86                    (self.df[col] >= lower_bound) & (self.df[col] <= upper_bound)
 87                ]
 88        
 89        after_count = len(self.df)
 90        self.log_step('remove_outliers', before_count, after_count,
 91                     f"Removed outliers from {columns}")
 92        
 93        return self
 94    
 95    def get_cleaned_data(self):
 96        """Return the cleaned DataFrame"""
 97        return self.df
 98    
 99    def get_validation_report(self):
100        """Return a summary of all validation steps"""
101        report = pd.DataFrame(self.validation_log)
102        
103        print("=== DATA VALIDATION REPORT ===")
104        print(f"Original shape: {self.original_shape}")
105        print(f"Final shape: {self.df.shape}")
106        print(f"Total rows removed: {self.original_shape[0] - self.df.shape[0]}")
107        print(f"Data retention: {self.df.shape[0]/self.original_shape[0]*100:.1f}%")
108        print("\nStep-by-step breakdown:")
109        print(report)
110        
111        return report
112
113# Use the validation pipeline
114validator = DataValidator(raw_data)
115clean_data = (validator
116              .remove_empty_rows(required_columns=['customer_id', 'amount'])
117              .remove_duplicates()
118              .fix_data_types({
119                  'order_date': 'datetime',
120                  'amount': 'numeric',
121                  'age': 'numeric'
122              })
123              .remove_outliers(['amount', 'age'])
124              .get_cleaned_data())
125
126# Get the validation report
127validation_report = validator.get_validation_report()

Creating Data Quality Checks That Run Automatically

The best validation happens automatically as part of your pipeline:

 1def create_data_quality_tests(df, config):
 2    """Run automated data quality tests"""
 3    
 4    tests_passed = []
 5    tests_failed = []
 6    
 7    # Test 1: Required columns exist
 8    required_cols = config.get('required_columns', [])
 9    missing_cols = [col for col in required_cols if col not in df.columns]
10    
11    if missing_cols:
12        tests_failed.append(f"Missing required columns: {missing_cols}")
13    else:
14        tests_passed.append("All required columns present")
15    
16    # Test 2: No excessive missing data
17    max_missing_pct = config.get('max_missing_percentage', 50)
18    for col in df.columns:
19        missing_pct = (df[col].isnull().sum() / len(df)) * 100
20        if missing_pct > max_missing_pct:
21            tests_failed.append(f"{col} has {missing_pct:.1f}% missing data (max: {max_missing_pct}%)")
22    
23    # Test 3: Data type validation
24    expected_types = config.get('expected_types', {})
25    for col, expected_type in expected_types.items():
26        if col in df.columns:
27            actual_type = str(df[col].dtype)
28            if expected_type not in actual_type:
29                tests_failed.append(f"{col} has type {actual_type}, expected {expected_type}")
30    
31    # Test 4: Value range validation
32    value_ranges = config.get('value_ranges', {})
33    for col, (min_val, max_val) in value_ranges.items():
34        if col in df.columns:
35            out_of_range = (df[col] < min_val) | (df[col] > max_val)
36            if out_of_range.any():
37                count = out_of_range.sum()
38                tests_failed.append(f"{col} has {count} values outside range [{min_val}, {max_val}]")
39    
40    # Test 5: Uniqueness validation
41    unique_columns = config.get('unique_columns', [])
42    for col in unique_columns:
43        if col in df.columns:
44            duplicates = df[col].duplicated().sum()
45            if duplicates > 0:
46                tests_failed.append(f"{col} has {duplicates} duplicate values (should be unique)")
47    
48    return {
49        'passed': tests_passed,
50        'failed': tests_failed,
51        'success': len(tests_failed) == 0
52    }
53
54# Configuration for your data quality tests
55quality_config = {
56    'required_columns': ['customer_id', 'order_date', 'amount'],
57    'max_missing_percentage': 10,
58    'expected_types': {
59        'customer_id': 'object',
60        'amount': 'float',
61        'order_date': 'datetime'
62    },
63    'value_ranges': {
64        'amount': (0, 10000),
65        'age': (0, 120)
66    },
67    'unique_columns': ['transaction_id']
68}
69
70# Run the tests
71test_results = create_data_quality_tests(df, quality_config)
72
73if test_results['success']:
74    print("✅ All data quality tests passed!")
75else:
76    print("❌ Data quality issues found:")
77    for issue in test_results['failed']:
78        print(f"  - {issue}")

The Big Picture: Making Data Quality Part of Your Process

Here’s what I’ve learned after years of dealing with messy data: data quality isn’t a one-time thing you do at the beginning of a project. It’s an ongoing process that needs to be built into every step of your pipeline.

Always validate early and often:

Check data quality when you first receive it
Validate after each transformation step
Test your output before sending it downstream

Document your assumptions:

What do you expect the data to look like?
What business rules should it follow?
What are acceptable ranges for each field?

Build reusable validation functions:

Don’t write the same data cleaning code over and over
Create libraries of validation functions you can reuse
Make your validations configurable for different datasets

Monitor data quality over time:

Set up alerts when data quality drops below thresholds
Track quality metrics in dashboards
Have a process for investigating quality issues

Remember, fixing data quality issues isn’t just about making your analysis work - it’s about building trust with your stakeholders. When people know they can rely on your data, they’ll use it to make important business decisions. That’s when you know you’re adding real value as a data engineer.

Clean data is the foundation of everything else you’ll do. Master these validation techniques, and you’ll save yourself countless hours of debugging later. More importantly, you’ll deliver results that people can actually trust and act on.

Working with Different Data Sources: APIs, Databases, Files, and More Pandas in Jupyter: The Dynamic Duo That Makes Data Analysis Feel Like Magic