Docker Spark SQL - Troubleshooting Guide

Docker Spark SQL - Troubleshooting Guide

🚨 Quick Emergency Fixes

🔥 “Everything is Broken” - Nuclear Option

1# Stop everything and restart fresh
2docker-compose down -v --remove-orphans
3docker system prune -f
4docker volume prune -f
5docker-compose up -d --build

⚡ “Just Need to Restart” - Soft Reset

1# Restart just the services
2docker-compose restart
3# Or restart specific service
4docker-compose restart spark-master

🐳 Docker Issues

1. Container Won’t Start

Symptom: docker-compose up fails or containers exit immediately

Common Causes & Solutions:

Port Already in Use

 1# Check what's using the port
 2lsof -i :8080  # For Spark UI
 3lsof -i :5432  # For PostgreSQL
 4lsof -i :8888  # For Jupyter
 5
 6# Kill the process using the port
 7sudo kill -9 <PID>
 8
 9# Or change ports in docker-compose.yml
10ports:
11  - "8081:8080"  # Use different host port

Insufficient Memory

1# Check Docker resource allocation
2docker system info | grep -i memory
3
4# Increase Docker memory limit (Docker Desktop):
5# Settings → Resources → Memory → Increase to 8GB+
6
7# For Linux, check available memory
8free -h

Volume Mount Issues

 1# Check if directories exist
 2ls -la data/
 3ls -la notebooks/
 4
 5# Create missing directories
 6mkdir -p data/raw data/processed data/features
 7mkdir -p notebooks config sql
 8
 9# Fix permissions
10sudo chown -R $USER:$USER data/ notebooks/ config/
11chmod -R 755 data/ notebooks/ config/

2. Cannot Connect to Services

Symptom: “Connection refused” when accessing Spark UI or Jupyter

Solutions:

Check Container Status

1# See which containers are running
2docker-compose ps
3
4# Check logs for specific service
5docker-compose logs spark-master
6docker-compose logs jupyter
7docker-compose logs postgres

Network Issues

1# Check if services are listening
2docker-compose exec spark-master netstat -tlnp | grep 8080
3docker-compose exec postgres netstat -tlnp | grep 5432
4
5# Test connectivity between containers
6docker-compose exec jupyter ping spark-master
7docker-compose exec jupyter ping postgres

Firewall/Security Issues

1# Disable firewall temporarily (Linux)
2sudo ufw disable
3
4# For macOS, check System Preferences → Security & Privacy
5
6# For Windows, check Windows Defender Firewall

3. Out of Disk Space

Symptom: “No space left on device”

Solutions:

 1# Check disk usage
 2df -h
 3docker system df
 4
 5# Clean up Docker resources
 6docker system prune -a --volumes
 7docker builder prune -a
 8
 9# Remove unused images
10docker image prune -a
11
12# Clean up old containers
13docker container prune

4. Docker Compose Version Issues

Symptom: “version not supported” or syntax errors

Solution:

1# Check Docker Compose version
2docker-compose --version
3
4# Update Docker Compose (Linux)
5sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
6sudo chmod +x /usr/local/bin/docker-compose
7
8# For older versions, use version 3.7 instead of 3.8 in docker-compose.yml

⚡ Spark Issues

1. Spark Session Creation Fails

Symptom: Cannot connect to Spark master or session creation hangs

Common Causes & Solutions:

Master Not Running

1# Check if Spark master is accessible
2import requests
3try:
4    response = requests.get("http://localhost:8080")
5    print("Spark master is running")
6except:
7    print("Cannot reach Spark master")

Wrong Master URL

1# Try different master configurations
2# For local development
3spark = SparkSession.builder.master("local[*]").getOrCreate()
4
5# For Docker cluster
6spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate()
7
8# Check from inside Jupyter container
9spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate()

Memory Configuration Issues

1spark = (SparkSession.builder
2         .appName("SmartCityIoTPipeline")
3         .master("local[*]")
4         .config("spark.driver.memory", "2g")  # Reduce if needed
5         .config("spark.executor.memory", "1g")  # Reduce if needed
6         .config("spark.driver.maxResultSize", "1g")
7         .getOrCreate())

2. Out of Memory Errors

Symptom: Java heap space or GC overhead limit exceeded

Solutions:

Increase Memory Allocation

1spark.conf.set("spark.driver.memory", "4g")
2spark.conf.set("spark.executor.memory", "2g")
3spark.conf.set("spark.driver.maxResultSize", "2g")

Optimize Data Processing

 1# Use sampling for large datasets
 2sample_df = large_df.sample(0.1, seed=42)
 3
 4# Cache frequently used DataFrames
 5df.cache()
 6df.count()  # Trigger caching
 7
 8# Repartition data
 9df = df.repartition(4)  # Fewer partitions for small datasets
10
11# Use coalesce to reduce partitions
12df = df.coalesce(2)

Process Data in Chunks

1# Process data month by month
2for month in range(1, 13):
3    monthly_data = df.filter(F.month("timestamp") == month)
4    # Process monthly_data
5    monthly_data.unpersist()  # Free memory

3. Slow Spark Jobs

Symptom: Jobs take very long time or appear to hang

Solutions:

Check Spark UI for Bottlenecks

  • Open http://localhost:4040 (or 4041, 4042 if multiple sessions)
  • Look at the Jobs tab for failed/slow stages
  • Check Executors tab for resource usage

Optimize Partitioning

1# Check current partitions
2print(f"Partitions: {df.rdd.getNumPartitions()}")
3
4# Optimal partitions = 2-3x number of cores
5optimal_partitions = spark.sparkContext.defaultParallelism * 2
6df = df.repartition(optimal_partitions)

Avoid Expensive Operations

 1# Avoid repeated .count() calls
 2count = df.count()
 3print(f"Records: {count}")
 4
 5# Use .cache() for DataFrames used multiple times
 6df.cache()
 7
 8# Avoid .collect() on large datasets
 9# Instead of:
10all_data = df.collect()  # BAD: loads all data to driver
11
12# Use:
13sample_data = df.limit(1000).collect()  # GOOD: only sample

Optimize Joins

1# Broadcast small DataFrames
2from pyspark.sql.functions import broadcast
3result = large_df.join(broadcast(small_df), "key")
4
5# Use appropriate join strategies
6spark.conf.set("spark.sql.adaptive.enabled", "true")
7spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

4. DataFrame Operations Fail

Symptom: AnalysisException or column not found errors

Solutions:

Check Schema and Column Names

1# Print schema to see exact column names
2df.printSchema()
3
4# Show column names
5print(df.columns)
6
7# Check for case sensitivity
8df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()])

Handle Null Values

1# Check for nulls before operations
2df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show()
3
4# Drop nulls before joins
5df_clean = df.na.drop(subset=['key_column'])
6
7# Fill nulls with defaults
8df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'})

Fix Data Type Issues

1# Cast columns to correct types
2df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
3df = df.withColumn("numeric_col", F.col("numeric_col").cast("double"))
4
5# Handle string/numeric conversion errors
6df = df.withColumn("safe_numeric", 
7    F.when(F.col("string_col").rlike("^[0-9.]+$"), 
8           F.col("string_col").cast("double")).otherwise(0))

🗄️ Database Connection Issues

1. Cannot Connect to PostgreSQL

Symptom: Connection refused or authentication failed

Solutions:

Check PostgreSQL Status

1# Check if PostgreSQL container is running
2docker-compose ps postgres
3
4# Check PostgreSQL logs
5docker-compose logs postgres
6
7# Test connection from host
8psql -h localhost -p 5432 -U postgres -d smartcity

From Jupyter/Spark Container

 1# Test database connection
 2import psycopg2
 3
 4try:
 5    conn = psycopg2.connect(
 6        host="postgres",  # Use container name, not localhost
 7        port=5432,
 8        user="postgres",
 9        password="password",
10        database="smartcity"
11    )
12    print("Database connection successful")
13    conn.close()
14except Exception as e:
15    print(f"Connection failed: {e}")

Spark JDBC Connection

 1# Correct JDBC URL for Docker
 2jdbc_url = "jdbc:postgresql://postgres:5432/smartcity"
 3
 4# Test Spark database connection
 5test_df = spark.read.format("jdbc") \
 6    .option("url", jdbc_url) \
 7    .option("dbtable", "(SELECT 1 as test) as test_table") \
 8    .option("user", "postgres") \
 9    .option("password", "password") \
10    .option("driver", "org.postgresql.Driver") \
11    .load()
12
13test_df.show()

2. JDBC Driver Issues

Symptom: ClassNotFoundException: org.postgresql.Driver

Solutions:

Add JDBC Driver to Spark

1spark = SparkSession.builder \
2    .appName("SmartCityIoTPipeline") \
3    .config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \
4    .getOrCreate()

Download Driver Manually

1# Download PostgreSQL JDBC driver
2cd /opt/bitnami/spark/jars/
3wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar

📊 Data Loading Issues

1. File Not Found Errors

Symptom: FileNotFoundException or path does not exist

Solutions:

Check File Paths

 1import os
 2
 3# Check if file exists
 4data_file = "data/raw/traffic_sensors.csv"
 5print(f"File exists: {os.path.exists(data_file)}")
 6
 7# List directory contents
 8print(os.listdir("data/raw/"))
 9
10# Use absolute paths if needed
11import os
12abs_path = os.path.abspath("data/raw/traffic_sensors.csv")
13df = spark.read.csv(abs_path, header=True, inferSchema=True)

Volume Mount Issues

1# Check if volumes are mounted correctly
2docker-compose exec jupyter ls -la /home/jovyan/work/data/
3
4# Verify volume mounts in docker-compose.yml
5volumes:
6  - ./data:/home/jovyan/work/data
7  - ./notebooks:/home/jovyan/work/notebooks

2. Schema Inference Problems

Symptom: Wrong data types or parsing errors

Solutions:

Explicit Schema Definition

 1from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
 2
 3# Define explicit schema
 4schema = StructType([
 5    StructField("sensor_id", StringType(), False),
 6    StructField("timestamp", StringType(), False),  # Read as string first
 7    StructField("vehicle_count", IntegerType(), True),
 8    StructField("avg_speed", DoubleType(), True)
 9])
10
11df = spark.read.csv("data/raw/traffic_sensors.csv", 
12                   header=True, schema=schema)
13
14# Then convert timestamp
15df = df.withColumn("timestamp", F.to_timestamp("timestamp"))

Handle Different Date Formats

1# Try different timestamp formats
2df = df.withColumn("timestamp", 
3    F.coalesce(
4        F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"),
5        F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"),
6        F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
7    ))

3. Large File Loading Issues

Symptom: Out of memory when loading large files

Solutions:

Process Files in Chunks

1# For very large CSV files, process line by line
2def process_large_csv(file_path, chunk_size=10000):
3    # Read in smaller chunks
4    df = spark.read.option("maxRecordsPerFile", chunk_size) \
5        .csv(file_path, header=True, inferSchema=True)
6    return df
7
8# Or split large files manually
9# split -l 100000 large_file.csv chunk_

Optimize File Format

1# Convert to Parquet for better performance
2df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet")
3
4# Read Parquet instead of CSV
5df = spark.read.parquet("data/processed/traffic_optimized.parquet")

🔧 Environment Setup Issues

1. Python Package Conflicts

Symptom: ImportError or version conflicts

Solutions:

Check Package Versions

1import sys
2print(f"Python version: {sys.version}")
3
4import pyspark
5print(f"PySpark version: {pyspark.__version__}")
6
7import pandas
8print(f"Pandas version: {pandas.__version__}")

Rebuild Jupyter Container

1# Rebuild with latest packages
2docker-compose down
3docker-compose build --no-cache jupyter
4docker-compose up -d

Manual Package Installation

1# Install packages in running container
2docker-compose exec jupyter pip install package_name
3
4# Or add to requirements.txt and rebuild

2. Jupyter Notebook Issues

Symptom: Kernel won’t start or crashes frequently

Solutions:

Restart Jupyter Kernel

  • In Jupyter: Kernel → Restart & Clear Output

Check Jupyter Logs

1docker-compose logs jupyter

Increase Memory Limits

1# In docker-compose.yml
2jupyter:
3  # ... other config
4  deploy:
5    resources:
6      limits:
7        memory: 4G

Clear Jupyter Cache

1# Remove Jupyter cache
2docker-compose exec jupyter rm -rf ~/.jupyter/
3docker-compose restart jupyter

🚀 Performance Optimization Tips

1. Spark Configuration Tuning

1# Optimal Spark configuration for development
2spark.conf.set("spark.sql.adaptive.enabled", "true")
3spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
4spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
5spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
6
7# Memory optimization
8spark.conf.set("spark.executor.memoryFraction", "0.8")
9spark.conf.set("spark.sql.shuffle.partitions", "200")  # Adjust based on data size

2. Data Processing Best Practices

 1# Cache DataFrames used multiple times
 2df.cache()
 3df.count()  # Trigger caching
 4
 5# Use appropriate file formats
 6# CSV (slowest) → JSON → Parquet (fastest)
 7
 8# Partition data for better performance
 9df.write.partitionBy("year", "month").parquet("partitioned_data")
10
11# Use column pruning
12df.select("col1", "col2").filter("col1 > 100")  # Better than df.filter().select()

3. Memory Management

1# Unpersist DataFrames when done
2df.unpersist()
3
4# Clear Spark context periodically
5spark.catalog.clearCache()
6
7# Monitor memory usage
8print(f"Cached tables: {spark.catalog.listTables()}")

🐞 Debugging Strategies

1. Enable Debug Logging

1# Set log level for debugging
2spark.sparkContext.setLogLevel("DEBUG")  # Very verbose
3spark.sparkContext.setLogLevel("INFO")   # Moderate
4spark.sparkContext.setLogLevel("WARN")   # Minimal (default)

2. Inspect Data at Each Step

1# Check DataFrame at each transformation
2print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}")
3df1.show(5)
4
5df2 = df1.filter(F.col("value") > 0)
6print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}")
7df2.show(5)

3. Use Explain Plans

1# See execution plan
2df.explain(True)
3
4# Check for expensive operations
5df.explain("cost")

4. Sample Data for Testing

1# Use small samples for development
2sample_df = large_df.sample(0.01, seed=42)  # 1% sample
3
4# Limit rows for testing
5test_df = df.limit(1000)

📋 Health Check Commands

Quick System Check Script

 1#!/bin/bash
 2echo "🔍 Smart City IoT Pipeline Health Check"
 3echo "======================================"
 4
 5echo "📋 Docker Status:"
 6docker --version
 7docker-compose --version
 8
 9echo "🐳 Container Status:"
10docker-compose ps
11
12echo "💾 Disk Usage:"
13df -h
14docker system df
15
16echo "🧠 Memory Usage:"
17free -h
18
19echo "🌐 Network Connectivity:"
20curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible"
21curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible"
22
23echo "🗄️ Database Status:"
24docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready"
25
26echo "📁 Data Files:"
27ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing"

Python Health Check

 1def health_check():
 2    """Run comprehensive health check"""
 3    checks = {
 4        "spark_session": False,
 5        "database_connection": False,
 6        "data_files": False,
 7        "memory_usage": False
 8    }
 9    
10    # Check Spark session
11    try:
12        spark.sparkContext.statusTracker()
13        checks["spark_session"] = True
14        print("✅ Spark session healthy")
15    except:
16        print("❌ Spark session issues")
17    
18    # Check database
19    try:
20        test_df = spark.read.format("jdbc") \
21            .option("url", "jdbc:postgresql://postgres:5432/smartcity") \
22            .option("dbtable", "(SELECT 1) as test") \
23            .option("user", "postgres") \
24            .option("password", "password") \
25            .load()
26        test_df.count()
27        checks["database_connection"] = True
28        print("✅ Database connection healthy")
29    except Exception as e:
30        print(f"❌ Database issues: {e}")
31    
32    # Check data files
33    try:
34        import os
35        required_files = [
36            "data/raw/traffic_sensors.csv",
37            "data/raw/air_quality.json", 
38            "data/raw/weather_data.parquet"
39        ]
40        
41        missing_files = [f for f in required_files if not os.path.exists(f)]
42        if not missing_files:
43            checks["data_files"] = True
44            print("✅ All data files present")
45        else:
46            print(f"❌ Missing files: {missing_files}")
47    except Exception as e:
48        print(f"❌ File check failed: {e}")
49    
50    # Check memory usage
51    try:
52        import psutil
53        memory_percent = psutil.virtual_memory().percent
54        if memory_percent < 80:
55            checks["memory_usage"] = True
56            print(f"✅ Memory usage OK: {memory_percent:.1f}%")
57        else:
58            print(f"⚠️ High memory usage: {memory_percent:.1f}%")
59    except:
60        print("❓ Cannot check memory usage")
61    
62    overall_health = sum(checks.values()) / len(checks) * 100
63    print(f"\n📊 Overall System Health: {overall_health:.1f}%")
64    
65    return checks
66
67# Run health check
68health_status = health_check()

🆘 When All Else Fails

Complete Environment Reset

 1# Nuclear option - complete reset
 2docker-compose down -v --remove-orphans
 3docker system prune -a --volumes
 4docker builder prune -a
 5
 6# Remove all project data (CAUTION!)
 7rm -rf data/processed/* data/features/*
 8
 9# Rebuild everything
10docker-compose build --no-cache
11docker-compose up -d
12
13# Regenerate sample data
14python scripts/generate_data.py

Get Help

  1. Check GitHub Issues: Look for similar problems in the project repository
  2. Stack Overflow: Search for Spark/Docker specific errors
  3. Spark Documentation: https://spark.apache.org/docs/latest/
  4. Docker Documentation: https://docs.docker.com/

Collect Diagnostic Information

 1# Gather system information for help requests
 2echo "System Information:" > diagnostic_info.txt
 3uname -a >> diagnostic_info.txt
 4docker --version >> diagnostic_info.txt
 5docker-compose --version >> diagnostic_info.txt
 6python --version >> diagnostic_info.txt
 7
 8echo "Container Status:" >> diagnostic_info.txt
 9docker-compose ps >> diagnostic_info.txt
10
11echo "Container Logs:" >> diagnostic_info.txt
12docker-compose logs --tail=50 >> diagnostic_info.txt
13
14echo "Disk Usage:" >> diagnostic_info.txt
15df -h >> diagnostic_info.txt
16docker system df >> diagnostic_info.txt

📚 Additional Resources

Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄