Docker Spark SQL - Troubleshooting Guide
🚨 Quick Emergency Fixes
🔥 “Everything is Broken” - Nuclear Option
1# Stop everything and restart fresh
2docker-compose down -v --remove-orphans
3docker system prune -f
4docker volume prune -f
5docker-compose up -d --build
⚡ “Just Need to Restart” - Soft Reset
1# Restart just the services
2docker-compose restart
3# Or restart specific service
4docker-compose restart spark-master
🐳 Docker Issues
1. Container Won’t Start
Symptom: docker-compose up
fails or containers exit immediately
Common Causes & Solutions:
Port Already in Use
1# Check what's using the port
2lsof -i :8080 # For Spark UI
3lsof -i :5432 # For PostgreSQL
4lsof -i :8888 # For Jupyter
5
6# Kill the process using the port
7sudo kill -9 <PID>
8
9# Or change ports in docker-compose.yml
10ports:
11 - "8081:8080" # Use different host port
Insufficient Memory
1# Check Docker resource allocation
2docker system info | grep -i memory
3
4# Increase Docker memory limit (Docker Desktop):
5# Settings → Resources → Memory → Increase to 8GB+
6
7# For Linux, check available memory
8free -h
Volume Mount Issues
1# Check if directories exist
2ls -la data/
3ls -la notebooks/
4
5# Create missing directories
6mkdir -p data/raw data/processed data/features
7mkdir -p notebooks config sql
8
9# Fix permissions
10sudo chown -R $USER:$USER data/ notebooks/ config/
11chmod -R 755 data/ notebooks/ config/
2. Cannot Connect to Services
Symptom: “Connection refused” when accessing Spark UI or Jupyter
Solutions:
Check Container Status
1# See which containers are running
2docker-compose ps
3
4# Check logs for specific service
5docker-compose logs spark-master
6docker-compose logs jupyter
7docker-compose logs postgres
Network Issues
1# Check if services are listening
2docker-compose exec spark-master netstat -tlnp | grep 8080
3docker-compose exec postgres netstat -tlnp | grep 5432
4
5# Test connectivity between containers
6docker-compose exec jupyter ping spark-master
7docker-compose exec jupyter ping postgres
Firewall/Security Issues
1# Disable firewall temporarily (Linux)
2sudo ufw disable
3
4# For macOS, check System Preferences → Security & Privacy
5
6# For Windows, check Windows Defender Firewall
3. Out of Disk Space
Symptom: “No space left on device”
Solutions:
1# Check disk usage
2df -h
3docker system df
4
5# Clean up Docker resources
6docker system prune -a --volumes
7docker builder prune -a
8
9# Remove unused images
10docker image prune -a
11
12# Clean up old containers
13docker container prune
4. Docker Compose Version Issues
Symptom: “version not supported” or syntax errors
Solution:
1# Check Docker Compose version
2docker-compose --version
3
4# Update Docker Compose (Linux)
5sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
6sudo chmod +x /usr/local/bin/docker-compose
7
8# For older versions, use version 3.7 instead of 3.8 in docker-compose.yml
⚡ Spark Issues
1. Spark Session Creation Fails
Symptom: Cannot connect to Spark master
or session creation hangs
Common Causes & Solutions:
Master Not Running
1# Check if Spark master is accessible
2import requests
3try:
4 response = requests.get("http://localhost:8080")
5 print("Spark master is running")
6except:
7 print("Cannot reach Spark master")
Wrong Master URL
1# Try different master configurations
2# For local development
3spark = SparkSession.builder.master("local[*]").getOrCreate()
4
5# For Docker cluster
6spark = SparkSession.builder.master("spark://spark-master:7077").getOrCreate()
7
8# Check from inside Jupyter container
9spark = SparkSession.builder.master("spark://localhost:7077").getOrCreate()
Memory Configuration Issues
1spark = (SparkSession.builder
2 .appName("SmartCityIoTPipeline")
3 .master("local[*]")
4 .config("spark.driver.memory", "2g") # Reduce if needed
5 .config("spark.executor.memory", "1g") # Reduce if needed
6 .config("spark.driver.maxResultSize", "1g")
7 .getOrCreate())
2. Out of Memory Errors
Symptom: Java heap space
or GC overhead limit exceeded
Solutions:
Increase Memory Allocation
1spark.conf.set("spark.driver.memory", "4g")
2spark.conf.set("spark.executor.memory", "2g")
3spark.conf.set("spark.driver.maxResultSize", "2g")
Optimize Data Processing
1# Use sampling for large datasets
2sample_df = large_df.sample(0.1, seed=42)
3
4# Cache frequently used DataFrames
5df.cache()
6df.count() # Trigger caching
7
8# Repartition data
9df = df.repartition(4) # Fewer partitions for small datasets
10
11# Use coalesce to reduce partitions
12df = df.coalesce(2)
Process Data in Chunks
1# Process data month by month
2for month in range(1, 13):
3 monthly_data = df.filter(F.month("timestamp") == month)
4 # Process monthly_data
5 monthly_data.unpersist() # Free memory
3. Slow Spark Jobs
Symptom: Jobs take very long time or appear to hang
Solutions:
Check Spark UI for Bottlenecks
- Open http://localhost:4040 (or 4041, 4042 if multiple sessions)
- Look at the Jobs tab for failed/slow stages
- Check Executors tab for resource usage
Optimize Partitioning
1# Check current partitions
2print(f"Partitions: {df.rdd.getNumPartitions()}")
3
4# Optimal partitions = 2-3x number of cores
5optimal_partitions = spark.sparkContext.defaultParallelism * 2
6df = df.repartition(optimal_partitions)
Avoid Expensive Operations
1# Avoid repeated .count() calls
2count = df.count()
3print(f"Records: {count}")
4
5# Use .cache() for DataFrames used multiple times
6df.cache()
7
8# Avoid .collect() on large datasets
9# Instead of:
10all_data = df.collect() # BAD: loads all data to driver
11
12# Use:
13sample_data = df.limit(1000).collect() # GOOD: only sample
Optimize Joins
1# Broadcast small DataFrames
2from pyspark.sql.functions import broadcast
3result = large_df.join(broadcast(small_df), "key")
4
5# Use appropriate join strategies
6spark.conf.set("spark.sql.adaptive.enabled", "true")
7spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
4. DataFrame Operations Fail
Symptom: AnalysisException
or column not found errors
Solutions:
Check Schema and Column Names
1# Print schema to see exact column names
2df.printSchema()
3
4# Show column names
5print(df.columns)
6
7# Check for case sensitivity
8df.select([F.col(c) for c in df.columns if 'timestamp' in c.lower()])
Handle Null Values
1# Check for nulls before operations
2df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df.columns]).show()
3
4# Drop nulls before joins
5df_clean = df.na.drop(subset=['key_column'])
6
7# Fill nulls with defaults
8df_filled = df.na.fill({'numeric_col': 0, 'string_col': 'unknown'})
Fix Data Type Issues
1# Cast columns to correct types
2df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
3df = df.withColumn("numeric_col", F.col("numeric_col").cast("double"))
4
5# Handle string/numeric conversion errors
6df = df.withColumn("safe_numeric",
7 F.when(F.col("string_col").rlike("^[0-9.]+$"),
8 F.col("string_col").cast("double")).otherwise(0))
🗄️ Database Connection Issues
1. Cannot Connect to PostgreSQL
Symptom: Connection refused
or authentication failed
Solutions:
Check PostgreSQL Status
1# Check if PostgreSQL container is running
2docker-compose ps postgres
3
4# Check PostgreSQL logs
5docker-compose logs postgres
6
7# Test connection from host
8psql -h localhost -p 5432 -U postgres -d smartcity
From Jupyter/Spark Container
1# Test database connection
2import psycopg2
3
4try:
5 conn = psycopg2.connect(
6 host="postgres", # Use container name, not localhost
7 port=5432,
8 user="postgres",
9 password="password",
10 database="smartcity"
11 )
12 print("Database connection successful")
13 conn.close()
14except Exception as e:
15 print(f"Connection failed: {e}")
Spark JDBC Connection
1# Correct JDBC URL for Docker
2jdbc_url = "jdbc:postgresql://postgres:5432/smartcity"
3
4# Test Spark database connection
5test_df = spark.read.format("jdbc") \
6 .option("url", jdbc_url) \
7 .option("dbtable", "(SELECT 1 as test) as test_table") \
8 .option("user", "postgres") \
9 .option("password", "password") \
10 .option("driver", "org.postgresql.Driver") \
11 .load()
12
13test_df.show()
2. JDBC Driver Issues
Symptom: ClassNotFoundException: org.postgresql.Driver
Solutions:
Add JDBC Driver to Spark
1spark = SparkSession.builder \
2 .appName("SmartCityIoTPipeline") \
3 .config("spark.jars.packages", "org.postgresql:postgresql:42.5.0") \
4 .getOrCreate()
Download Driver Manually
1# Download PostgreSQL JDBC driver
2cd /opt/bitnami/spark/jars/
3wget https://jdbc.postgresql.org/download/postgresql-42.5.0.jar
📊 Data Loading Issues
1. File Not Found Errors
Symptom: FileNotFoundException
or path does not exist
Solutions:
Check File Paths
1import os
2
3# Check if file exists
4data_file = "data/raw/traffic_sensors.csv"
5print(f"File exists: {os.path.exists(data_file)}")
6
7# List directory contents
8print(os.listdir("data/raw/"))
9
10# Use absolute paths if needed
11import os
12abs_path = os.path.abspath("data/raw/traffic_sensors.csv")
13df = spark.read.csv(abs_path, header=True, inferSchema=True)
Volume Mount Issues
1# Check if volumes are mounted correctly
2docker-compose exec jupyter ls -la /home/jovyan/work/data/
3
4# Verify volume mounts in docker-compose.yml
5volumes:
6 - ./data:/home/jovyan/work/data
7 - ./notebooks:/home/jovyan/work/notebooks
2. Schema Inference Problems
Symptom: Wrong data types or parsing errors
Solutions:
Explicit Schema Definition
1from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
2
3# Define explicit schema
4schema = StructType([
5 StructField("sensor_id", StringType(), False),
6 StructField("timestamp", StringType(), False), # Read as string first
7 StructField("vehicle_count", IntegerType(), True),
8 StructField("avg_speed", DoubleType(), True)
9])
10
11df = spark.read.csv("data/raw/traffic_sensors.csv",
12 header=True, schema=schema)
13
14# Then convert timestamp
15df = df.withColumn("timestamp", F.to_timestamp("timestamp"))
Handle Different Date Formats
1# Try different timestamp formats
2df = df.withColumn("timestamp",
3 F.coalesce(
4 F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss"),
5 F.to_timestamp("timestamp", "MM/dd/yyyy HH:mm:ss"),
6 F.to_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")
7 ))
3. Large File Loading Issues
Symptom: Out of memory when loading large files
Solutions:
Process Files in Chunks
1# For very large CSV files, process line by line
2def process_large_csv(file_path, chunk_size=10000):
3 # Read in smaller chunks
4 df = spark.read.option("maxRecordsPerFile", chunk_size) \
5 .csv(file_path, header=True, inferSchema=True)
6 return df
7
8# Or split large files manually
9# split -l 100000 large_file.csv chunk_
Optimize File Format
1# Convert to Parquet for better performance
2df.write.mode("overwrite").parquet("data/processed/traffic_optimized.parquet")
3
4# Read Parquet instead of CSV
5df = spark.read.parquet("data/processed/traffic_optimized.parquet")
🔧 Environment Setup Issues
1. Python Package Conflicts
Symptom: ImportError
or version conflicts
Solutions:
Check Package Versions
1import sys
2print(f"Python version: {sys.version}")
3
4import pyspark
5print(f"PySpark version: {pyspark.__version__}")
6
7import pandas
8print(f"Pandas version: {pandas.__version__}")
Rebuild Jupyter Container
1# Rebuild with latest packages
2docker-compose down
3docker-compose build --no-cache jupyter
4docker-compose up -d
Manual Package Installation
1# Install packages in running container
2docker-compose exec jupyter pip install package_name
3
4# Or add to requirements.txt and rebuild
2. Jupyter Notebook Issues
Symptom: Kernel won’t start or crashes frequently
Solutions:
Restart Jupyter Kernel
- In Jupyter: Kernel → Restart & Clear Output
Check Jupyter Logs
1docker-compose logs jupyter
Increase Memory Limits
1# In docker-compose.yml
2jupyter:
3 # ... other config
4 deploy:
5 resources:
6 limits:
7 memory: 4G
Clear Jupyter Cache
1# Remove Jupyter cache
2docker-compose exec jupyter rm -rf ~/.jupyter/
3docker-compose restart jupyter
🚀 Performance Optimization Tips
1. Spark Configuration Tuning
1# Optimal Spark configuration for development
2spark.conf.set("spark.sql.adaptive.enabled", "true")
3spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
4spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
5spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
6
7# Memory optimization
8spark.conf.set("spark.executor.memoryFraction", "0.8")
9spark.conf.set("spark.sql.shuffle.partitions", "200") # Adjust based on data size
2. Data Processing Best Practices
1# Cache DataFrames used multiple times
2df.cache()
3df.count() # Trigger caching
4
5# Use appropriate file formats
6# CSV (slowest) → JSON → Parquet (fastest)
7
8# Partition data for better performance
9df.write.partitionBy("year", "month").parquet("partitioned_data")
10
11# Use column pruning
12df.select("col1", "col2").filter("col1 > 100") # Better than df.filter().select()
3. Memory Management
1# Unpersist DataFrames when done
2df.unpersist()
3
4# Clear Spark context periodically
5spark.catalog.clearCache()
6
7# Monitor memory usage
8print(f"Cached tables: {spark.catalog.listTables()}")
🐞 Debugging Strategies
1. Enable Debug Logging
1# Set log level for debugging
2spark.sparkContext.setLogLevel("DEBUG") # Very verbose
3spark.sparkContext.setLogLevel("INFO") # Moderate
4spark.sparkContext.setLogLevel("WARN") # Minimal (default)
2. Inspect Data at Each Step
1# Check DataFrame at each transformation
2print(f"Step 1 - Rows: {df1.count()}, Columns: {len(df1.columns)}")
3df1.show(5)
4
5df2 = df1.filter(F.col("value") > 0)
6print(f"Step 2 - Rows: {df2.count()}, Columns: {len(df2.columns)}")
7df2.show(5)
3. Use Explain Plans
1# See execution plan
2df.explain(True)
3
4# Check for expensive operations
5df.explain("cost")
4. Sample Data for Testing
1# Use small samples for development
2sample_df = large_df.sample(0.01, seed=42) # 1% sample
3
4# Limit rows for testing
5test_df = df.limit(1000)
📋 Health Check Commands
Quick System Check Script
1#!/bin/bash
2echo "🔍 Smart City IoT Pipeline Health Check"
3echo "======================================"
4
5echo "📋 Docker Status:"
6docker --version
7docker-compose --version
8
9echo "🐳 Container Status:"
10docker-compose ps
11
12echo "💾 Disk Usage:"
13df -h
14docker system df
15
16echo "🧠 Memory Usage:"
17free -h
18
19echo "🌐 Network Connectivity:"
20curl -s -o /dev/null -w "%{http_code}" http://localhost:8080 && echo " ✅ Spark UI accessible" || echo " ❌ Spark UI not accessible"
21curl -s -o /dev/null -w "%{http_code}" http://localhost:8888 && echo " ✅ Jupyter accessible" || echo " ❌ Jupyter not accessible"
22
23echo "🗄️ Database Status:"
24docker-compose exec -T postgres pg_isready -U postgres && echo " ✅ PostgreSQL ready" || echo " ❌ PostgreSQL not ready"
25
26echo "📁 Data Files:"
27ls -la data/raw/ 2>/dev/null && echo " ✅ Raw data found" || echo " ❌ Raw data missing"
Python Health Check
1def health_check():
2 """Run comprehensive health check"""
3 checks = {
4 "spark_session": False,
5 "database_connection": False,
6 "data_files": False,
7 "memory_usage": False
8 }
9
10 # Check Spark session
11 try:
12 spark.sparkContext.statusTracker()
13 checks["spark_session"] = True
14 print("✅ Spark session healthy")
15 except:
16 print("❌ Spark session issues")
17
18 # Check database
19 try:
20 test_df = spark.read.format("jdbc") \
21 .option("url", "jdbc:postgresql://postgres:5432/smartcity") \
22 .option("dbtable", "(SELECT 1) as test") \
23 .option("user", "postgres") \
24 .option("password", "password") \
25 .load()
26 test_df.count()
27 checks["database_connection"] = True
28 print("✅ Database connection healthy")
29 except Exception as e:
30 print(f"❌ Database issues: {e}")
31
32 # Check data files
33 try:
34 import os
35 required_files = [
36 "data/raw/traffic_sensors.csv",
37 "data/raw/air_quality.json",
38 "data/raw/weather_data.parquet"
39 ]
40
41 missing_files = [f for f in required_files if not os.path.exists(f)]
42 if not missing_files:
43 checks["data_files"] = True
44 print("✅ All data files present")
45 else:
46 print(f"❌ Missing files: {missing_files}")
47 except Exception as e:
48 print(f"❌ File check failed: {e}")
49
50 # Check memory usage
51 try:
52 import psutil
53 memory_percent = psutil.virtual_memory().percent
54 if memory_percent < 80:
55 checks["memory_usage"] = True
56 print(f"✅ Memory usage OK: {memory_percent:.1f}%")
57 else:
58 print(f"⚠️ High memory usage: {memory_percent:.1f}%")
59 except:
60 print("❓ Cannot check memory usage")
61
62 overall_health = sum(checks.values()) / len(checks) * 100
63 print(f"\n📊 Overall System Health: {overall_health:.1f}%")
64
65 return checks
66
67# Run health check
68health_status = health_check()
🆘 When All Else Fails
Complete Environment Reset
1# Nuclear option - complete reset
2docker-compose down -v --remove-orphans
3docker system prune -a --volumes
4docker builder prune -a
5
6# Remove all project data (CAUTION!)
7rm -rf data/processed/* data/features/*
8
9# Rebuild everything
10docker-compose build --no-cache
11docker-compose up -d
12
13# Regenerate sample data
14python scripts/generate_data.py
Get Help
- Check GitHub Issues: Look for similar problems in the project repository
- Stack Overflow: Search for Spark/Docker specific errors
- Spark Documentation: https://spark.apache.org/docs/latest/
- Docker Documentation: https://docs.docker.com/
Collect Diagnostic Information
1# Gather system information for help requests
2echo "System Information:" > diagnostic_info.txt
3uname -a >> diagnostic_info.txt
4docker --version >> diagnostic_info.txt
5docker-compose --version >> diagnostic_info.txt
6python --version >> diagnostic_info.txt
7
8echo "Container Status:" >> diagnostic_info.txt
9docker-compose ps >> diagnostic_info.txt
10
11echo "Container Logs:" >> diagnostic_info.txt
12docker-compose logs --tail=50 >> diagnostic_info.txt
13
14echo "Disk Usage:" >> diagnostic_info.txt
15df -h >> diagnostic_info.txt
16docker system df >> diagnostic_info.txt
📚 Additional Resources
- Spark Tuning Guide: https://spark.apache.org/docs/latest/tuning.html
- Docker Best Practices: https://docs.docker.com/develop/best-practices/
- PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/
- PostgreSQL Docker Guide: https://hub.docker.com/_/postgres
Remember: Most issues are environment-related. When in doubt, restart containers and check logs! 🔄