Your First Real Pipeline: Building a Timecard Processor

Alright, so you’ve read about data pipelines and you’re thinking “this sounds cool, but show me some actual code!” Well, buckle up because we’re about to dive into a real pipeline that you can run on your own computer. It’s simple, practical, and does exactly what every data pipeline does: takes messy data and makes it useful.

The Problem We’re Solving

Let’s say you’re working at a small company and every week, the HR department gives you a CSV file with employee timecard data. But here’s the thing - they need this data processed and converted to JSON format so their payroll system can read it. Plus, they want the gross pay calculated for each employee.

Sound familiar? This is exactly the kind of real-world problem data engineers solve every day.

The Data We’re Working With

Here’s our input CSV file (timecard_data.csv):

1name,hours_worked,hourly_rate
2Alice,40,25
3Bob,35,20
4Charlie,30,30

Simple enough, right? Three employees, their hours worked, and their hourly rates. But the payroll system needs this data in JSON format with the gross pay calculated. That’s where our pipeline comes in.

The Pipeline Code

Here’s the complete pipeline code from the PipelineOne repository:

 1import csv
 2import json
 3
 4# Define the input CSV file and output JSON file
 5csv_file = 'timecard_data.csv'
 6json_file = 'timecard_data.json'
 7
 8# Read the CSV file and calculate the total gross pay
 9timecard_data = []
10with open(csv_file, mode='r') as file:
11    csv_reader = csv.DictReader(file)
12    for row in csv_reader:
13        name = row['name']
14        hours_worked = float(row['hours_worked'])
15        hourly_rate = float(row['hourly_rate'])
16        gross_pay = hours_worked * hourly_rate
17        timecard_data.append({
18            'name': name,
19            'hours_worked': hours_worked,
20            'hourly_rate': hourly_rate,
21            'gross_pay': gross_pay
22        })
23
24# Write the data to a JSON file
25with open(json_file, mode='w') as file:
26    json.dump(timecard_data, file, indent=4)
27
28print("Data has been successfully written to", json_file)

Let’s Break This Down Step by Step

Step 1: The Imports and Setup

1import csv
2import json
3
4csv_file = 'timecard_data.csv'
5json_file = 'timecard_data.json'

First, we import the libraries we need. Python’s csv module helps us read CSV files easily, and json helps us write JSON files. Then we define our input and output file names. This is good practice - if you need to change file names later, you only have to do it in one place.

Step 2: Extract and Transform (The Heart of the Pipeline)

 1timecard_data = []
 2with open(csv_file, mode='r') as file:
 3    csv_reader = csv.DictReader(file)
 4    for row in csv_reader:
 5        name = row['name']
 6        hours_worked = float(row['hours_worked'])
 7        hourly_rate = float(row['hourly_rate'])
 8        gross_pay = hours_worked * hourly_rate
 9        timecard_data.append({
10            'name': name,
11            'hours_worked': hours_worked,
12            'hourly_rate': hourly_rate,
13            'gross_pay': gross_pay
14        })

This is where the magic happens! Let’s trace through what happens for each employee:

Extract: We read each row from the CSV file
Transform: We convert the string values to floats (because math needs numbers, not text)
Calculate: We compute the gross pay by multiplying hours by rate
Structure: We create a dictionary with all the employee data, including the calculated gross pay

For Alice, this transforms:

Input: "Alice", "40", "25"
Output: {"name": "Alice", "hours_worked": 40.0, "hourly_rate": 25.0, "gross_pay": 1000.0}

Step 3: Load (Save the Results)

1with open(json_file, mode='w') as file:
2    json.dump(timecard_data, file, indent=4)
3
4print("Data has been successfully written to", json_file)

Finally, we save our processed data to a JSON file. The indent=4 makes it pretty and readable. Always be nice to the humans who might need to look at your output files!

The Final Output

When you run this pipeline, you get this beautiful JSON file (timecard_data.json):

 1[
 2    {
 3        "name": "Alice",
 4        "hours_worked": 40.0,
 5        "hourly_rate": 25.0,
 6        "gross_pay": 1000.0
 7    },
 8    {
 9        "name": "Bob",
10        "hours_worked": 35.0,
11        "hourly_rate": 20.0,
12        "gross_pay": 700.0
13    },
14    {
15        "name": "Charlie",
16        "hours_worked": 30.0,
17        "hourly_rate": 30.0,
18        "gross_pay": 900.0
19    }
20]

Look at that! We’ve successfully:

✅ Extracted data from a CSV file
✅ Transformed it by calculating gross pay
✅ Loaded it into a JSON format the payroll system can use

Why This Pipeline Works

This might seem simple, but it demonstrates all the core principles of good data engineering:

Clear Input/Output: We know exactly what goes in and what comes out
Data Transformation: We’re not just moving data; we’re adding value by calculating gross pay
Format Conversion: We’re solving a real integration problem (CSV to JSON)
Readable Code: Another developer can understand this in 30 seconds
Reliable: It does the same thing every time you run it

Try It Yourself

Want to run this pipeline? Here’s what you need to do:

Clone the repository: git clone https://github.com/ZCW-Summer25/PipelineOne.git
Navigate to the directory: cd PipelineOne
Run the pipeline: python pipe1.py
Check the output: cat timecard_data.json

Making It Even Better

This pipeline is great for learning, but in the real world, you’d want to add:

Error handling: What if the CSV file doesn’t exist?
Data validation: What if someone enters negative hours?
Logging: Track when the pipeline runs and if it succeeds
Configuration: Make it easy to change input/output file names
Testing: Ensure it works with different data

The Big Picture

This little 20-line script is doing exactly what billion-dollar companies do with their data - just at a smaller scale. You’re extracting data from one system, transforming it to add value, and loading it into another system. That’s the essence of data engineering!

Next time someone asks you “what’s a data pipeline?” you can show them this code and say “it’s like this, but bigger.” Because honestly, that’s exactly what it is.

Ready to build something more complex? Let’s keep going!

Mastering Jupyter Notebooks: Your Complete User Guide to Data Science Productivity Pipeline Basics: The Core Concepts Every Data Engineer Needs to Master