Your First Real Pipeline: Building a Timecard Processor
Alright, so you’ve read about data pipelines and you’re thinking “this sounds cool, but show me some actual code!” Well, buckle up because we’re about to dive into a real pipeline that you can run on your own computer. It’s simple, practical, and does exactly what every data pipeline does: takes messy data and makes it useful.
The Problem We’re Solving
Let’s say you’re working at a small company and every week, the HR department gives you a CSV file with employee timecard data. But here’s the thing - they need this data processed and converted to JSON format so their payroll system can read it. Plus, they want the gross pay calculated for each employee.
Sound familiar? This is exactly the kind of real-world problem data engineers solve every day.
The Data We’re Working With
Here’s our input CSV file (timecard_data.csv
):
1name,hours_worked,hourly_rate
2Alice,40,25
3Bob,35,20
4Charlie,30,30
Simple enough, right? Three employees, their hours worked, and their hourly rates. But the payroll system needs this data in JSON format with the gross pay calculated. That’s where our pipeline comes in.
The Pipeline Code
Here’s the complete pipeline code from the PipelineOne repository:
1import csv
2import json
3
4# Define the input CSV file and output JSON file
5csv_file = 'timecard_data.csv'
6json_file = 'timecard_data.json'
7
8# Read the CSV file and calculate the total gross pay
9timecard_data = []
10with open(csv_file, mode='r') as file:
11 csv_reader = csv.DictReader(file)
12 for row in csv_reader:
13 name = row['name']
14 hours_worked = float(row['hours_worked'])
15 hourly_rate = float(row['hourly_rate'])
16 gross_pay = hours_worked * hourly_rate
17 timecard_data.append({
18 'name': name,
19 'hours_worked': hours_worked,
20 'hourly_rate': hourly_rate,
21 'gross_pay': gross_pay
22 })
23
24# Write the data to a JSON file
25with open(json_file, mode='w') as file:
26 json.dump(timecard_data, file, indent=4)
27
28print("Data has been successfully written to", json_file)
Let’s Break This Down Step by Step
Step 1: The Imports and Setup
1import csv
2import json
3
4csv_file = 'timecard_data.csv'
5json_file = 'timecard_data.json'
First, we import the libraries we need. Python’s csv
module helps us read CSV files easily, and json
helps us write JSON files. Then we define our input and output file names. This is good practice - if you need to change file names later, you only have to do it in one place.
Step 2: Extract and Transform (The Heart of the Pipeline)
1timecard_data = []
2with open(csv_file, mode='r') as file:
3 csv_reader = csv.DictReader(file)
4 for row in csv_reader:
5 name = row['name']
6 hours_worked = float(row['hours_worked'])
7 hourly_rate = float(row['hourly_rate'])
8 gross_pay = hours_worked * hourly_rate
9 timecard_data.append({
10 'name': name,
11 'hours_worked': hours_worked,
12 'hourly_rate': hourly_rate,
13 'gross_pay': gross_pay
14 })
This is where the magic happens! Let’s trace through what happens for each employee:
- Extract: We read each row from the CSV file
- Transform: We convert the string values to floats (because math needs numbers, not text)
- Calculate: We compute the gross pay by multiplying hours by rate
- Structure: We create a dictionary with all the employee data, including the calculated gross pay
For Alice, this transforms:
- Input:
"Alice", "40", "25"
- Output:
{"name": "Alice", "hours_worked": 40.0, "hourly_rate": 25.0, "gross_pay": 1000.0}
Step 3: Load (Save the Results)
1with open(json_file, mode='w') as file:
2 json.dump(timecard_data, file, indent=4)
3
4print("Data has been successfully written to", json_file)
Finally, we save our processed data to a JSON file. The indent=4
makes it pretty and readable. Always be nice to the humans who might need to look at your output files!
The Final Output
When you run this pipeline, you get this beautiful JSON file (timecard_data.json
):
1[
2 {
3 "name": "Alice",
4 "hours_worked": 40.0,
5 "hourly_rate": 25.0,
6 "gross_pay": 1000.0
7 },
8 {
9 "name": "Bob",
10 "hours_worked": 35.0,
11 "hourly_rate": 20.0,
12 "gross_pay": 700.0
13 },
14 {
15 "name": "Charlie",
16 "hours_worked": 30.0,
17 "hourly_rate": 30.0,
18 "gross_pay": 900.0
19 }
20]
Look at that! We’ve successfully:
- ✅ Extracted data from a CSV file
- ✅ Transformed it by calculating gross pay
- ✅ Loaded it into a JSON format the payroll system can use
Why This Pipeline Works
This might seem simple, but it demonstrates all the core principles of good data engineering:
- Clear Input/Output: We know exactly what goes in and what comes out
- Data Transformation: We’re not just moving data; we’re adding value by calculating gross pay
- Format Conversion: We’re solving a real integration problem (CSV to JSON)
- Readable Code: Another developer can understand this in 30 seconds
- Reliable: It does the same thing every time you run it
Try It Yourself
Want to run this pipeline? Here’s what you need to do:
- Clone the repository:
git clone https://github.com/ZCW-Summer25/PipelineOne.git
- Navigate to the directory:
cd PipelineOne
- Run the pipeline:
python pipe1.py
- Check the output:
cat timecard_data.json
Making It Even Better
This pipeline is great for learning, but in the real world, you’d want to add:
- Error handling: What if the CSV file doesn’t exist?
- Data validation: What if someone enters negative hours?
- Logging: Track when the pipeline runs and if it succeeds
- Configuration: Make it easy to change input/output file names
- Testing: Ensure it works with different data
The Big Picture
This little 20-line script is doing exactly what billion-dollar companies do with their data - just at a smaller scale. You’re extracting data from one system, transforming it to add value, and loading it into another system. That’s the essence of data engineering!
Next time someone asks you “what’s a data pipeline?” you can show them this code and say “it’s like this, but bigger.” Because honestly, that’s exactly what it is.
Ready to build something more complex? Let’s keep going!