CSV & JSON Handling in Python

Python provides powerful built-in libraries for working with CSV and JSON data, two of the most common data exchange formats. The csv module handles comma-separated values files, while the json module deals with JavaScript Object Notation files and strings.

CSV Handling

Reading CSV Files

The csv module provides functions to read from and write to CSV files.

import csv

# Reading a CSV file using csv.reader
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.reader(file)
    # Skip the header row
    header = next(csv_reader)
    print(f"Column names: {header}")
    
    # Process each row
    for row in csv_reader:
        print(row)  # row is a list of strings

# Reading a CSV file using csv.DictReader (maps rows to dictionaries)
with open('data.csv', 'r', newline='') as file:
    csv_reader = csv.DictReader(file)
    for row in csv_reader:
        print(row)  # row is a dictionary with column names as keys
        print(f"Name: {row['name']}, Age: {row['age']}")

Writing CSV Files

You can write data to CSV files using csv.writer or csv.DictWriter.

import csv

# Writing to a CSV file using csv.writer
data = [
    ['Name', 'Age', 'City'],  # Header row
    ['Alice', '30', 'New York'],
    ['Bob', '25', 'Los Angeles'],
    ['Charlie', '35', 'Chicago']
]

with open('output.csv', 'w', newline='') as file:
    csv_writer = csv.writer(file)
    # Write all rows at once
    csv_writer.writerows(data)
    
    # Or write row by row
    # for row in data:
    #     csv_writer.writerow(row)

# Writing to a CSV file using csv.DictWriter
data = [
    {'Name': 'Alice', 'Age': 30, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 25, 'City': 'Los Angeles'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]

with open('output_dict.csv', 'w', newline='') as file:
    fieldnames = ['Name', 'Age', 'City']
    csv_writer = csv.DictWriter(file, fieldnames=fieldnames)
    
    # Write the header row
    csv_writer.writeheader()
    
    # Write all rows at once
    csv_writer.writerows(data)
    
    # Or write row by row
    # for row in data:
    #     csv_writer.writerow(row)

CSV Dialects and Formatting

CSV files can have different delimiters, quote characters, and other formatting conventions. The csv module allows you to customize these through dialects or format parameters.

import csv

# Reading a TSV (tab-separated values) file
with open('data.tsv', 'r', newline='') as file:
    tsv_reader = csv.reader(file, delimiter='\t')
    for row in tsv_reader:
        print(row)

# Creating a custom dialect
csv.register_dialect('pipes', delimiter='|', quoting=csv.QUOTE_NONE, escapechar='\\')

# Using the custom dialect
with open('data_pipes.csv', 'r', newline='') as file:
    pipes_reader = csv.reader(file, dialect='pipes')
    for row in pipes_reader:
        print(row)

# Customizing format parameters directly
with open('custom.csv', 'w', newline='') as file:
    csv_writer = csv.writer(
        file,
        delimiter=',',
        quotechar='"',
        quoting=csv.QUOTE_MINIMAL,
        escapechar='\\',
        lineterminator='\n'
    )
    csv_writer.writerow(['Field with, comma', 'Normal field', 'Field with "quotes"'])

Working with CSV in Pandas

For more advanced CSV handling, the Pandas library offers a powerful alternative with its DataFrame class.

import pandas as pd

# Reading a CSV file into a DataFrame
df = pd.read_csv('data.csv')
print(df.head())  # Display the first 5 rows

# Selecting specific columns
subset = df[['Name', 'Age']]

# Filtering rows
filtered = df[df['Age'] > 30]

# Working with the data
average_age = df['Age'].mean()
print(f"Average age: {average_age:.2f}")

# Writing a DataFrame to a CSV file
filtered.to_csv('filtered_data.csv', index=False)

JSON Handling

Reading and Parsing JSON

The json module provides functions to parse JSON strings into Python objects and vice versa.

import json

# Reading from a JSON file
with open('data.json', 'r') as file:
    data = json.load(file)
    print(data)  # Python dictionary or list

# Parsing a JSON string
json_string = '{"name": "Alice", "age": 30, "city": "New York"}'
parsed_data = json.loads(json_string)
print(parsed_data['name'])  # Accessing dictionary keys
print(parsed_data.get('country', 'Unknown'))  # Using get() with a default value

Writing JSON

You can convert Python objects to JSON strings or write them directly to files.

import json

# Converting Python object to JSON string
data = {
    'name': 'Alice',
    'age': 30,
    'is_student': False,
    'courses': ['Python', 'Data Science', 'Web Development'],
    'address': {
        'city': 'New York',
        'zip': '10001'
    }
}

json_string = json.dumps(data)
print(json_string)

# Pretty printing with indentation
pretty_json = json.dumps(data, indent=4, sort_keys=True)
print(pretty_json)

# Writing to a JSON file
with open('output.json', 'w') as file:
    json.dump(data, file, indent=4)

Custom JSON Encoding and Decoding

For objects that are not directly serializable to JSON, you can create custom encoders and decoders.

import json
from datetime import datetime

# Custom JSON Encoder
class CustomEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        elif hasattr(obj, 'to_json'):
            return obj.to_json()
        else:
            return super().default(obj)

# Class with a to_json method
class Person:
    def __init__(self, name, age, birth_date):
        self.name = name
        self.age = age
        self.birth_date = birth_date
        
    def to_json(self):
        return {
            'name': self.name,
            'age': self.age,
            'birth_date': self.birth_date
        }

# Using the custom encoder
data = {
    'current_time': datetime.now(),
    'person': Person('Alice', 30, datetime(1992, 5, 12))
}

json_string = json.dumps(data, cls=CustomEncoder, indent=4)
print(json_string)

# Custom JSON Decoder function
def decode_datetime(dct):
    for key, value in dct.items():
        if key.endswith('_date') and isinstance(value, str):
            try:
                dct[key] = datetime.fromisoformat(value)
            except ValueError:
                pass  # Not a valid datetime string
    return dct

# Using the custom decoder
json_data = '{"name": "Bob", "birth_date": "1995-08-20T00:00:00"}'
parsed_data = json.loads(json_data, object_hook=decode_datetime)
print(parsed_data)
print(type(parsed_data['birth_date']))

JSON with Pandas

Pandas provides convenient methods for working with JSON data as well.

import pandas as pd

# Reading JSON into a DataFrame
df = pd.read_json('data.json')
print(df.head())

# Reading from a JSON string
json_string = '[{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]'
df = pd.read_json(json_string)
print(df)

# Reading nested JSON
nested_json = '''
{
    "people": [
        {"name": "Alice", "age": 30},
        {"name": "Bob", "age": 25}
    ],
    "location": {
        "city": "New York",
        "country": "USA"
    }
}
'''
df = pd.json_normalize(json.loads(nested_json), record_path=['people'])
print(df)

# Converting DataFrame to JSON
json_result = df.to_json(orient='records')
print(json_result)

# Writing DataFrame to a JSON file
df.to_json('output_pandas.json', orient='records', indent=4)

Common Tasks and Patterns

Converting Between CSV and JSON

You can convert data between CSV and JSON formats using both standard libraries and Pandas.

import csv
import json
import pandas as pd

# CSV to JSON using standard libraries
def csv_to_json(csv_file, json_file):
    data = []
    with open(csv_file, 'r', newline='') as file:
        csv_reader = csv.DictReader(file)
        for row in csv_reader:
            data.append(row)
    
    with open(json_file, 'w') as file:
        json.dump(data, file, indent=4)

# JSON to CSV using standard libraries
def json_to_csv(json_file, csv_file):
    with open(json_file, 'r') as file:
        data = json.load(file)
    
    # Assume data is a list of dictionaries with the same keys
    if data and isinstance(data, list):
        fieldnames = data[0].keys()
        
        with open(csv_file, 'w', newline='') as file:
            csv_writer = csv.DictWriter(file, fieldnames=fieldnames)
            csv_writer.writeheader()
            csv_writer.writerows(data)

# CSV to JSON using Pandas
def csv_to_json_pandas(csv_file, json_file):
    df = pd.read_csv(csv_file)
    df.to_json(json_file, orient='records', indent=4)

# JSON to CSV using Pandas
def json_to_csv_pandas(json_file, csv_file):
    df = pd.read_json(json_file)
    df.to_csv(csv_file, index=False)

Data Validation and Cleaning

When working with CSV and JSON data, you often need to validate and clean the data before processing it.

import pandas as pd
import numpy as np
import json
from datetime import datetime

# Data cleaning with Pandas
def clean_csv_data(csv_file):
    df = pd.read_csv(csv_file)
    
    # Handle missing values
    df['Age'] = df['Age'].fillna(df['Age'].mean())
    df['Name'] = df['Name'].fillna('Unknown')
    
    # Convert data types
    df['Age'] = df['Age'].astype(int)
    
    # Remove duplicates
    df = df.drop_duplicates()
    
    # Custom transformations
    df['Name'] = df['Name'].str.title()  # Capitalize names
    
    return df

# JSON data validation
def validate_json_data(json_data):
    required_fields = ['name', 'email', 'age']
    errors = []
    
    # Check if it's a valid JSON
    try:
        if isinstance(json_data, str):
            data = json.loads(json_data)
        else:
            data = json_data
    except json.JSONDecodeError:
        return ["Invalid JSON format"]
    
    # Validate each record
    if isinstance(data, list):
        for i, record in enumerate(data):
            # Check required fields
            for field in required_fields:
                if field not in record:
                    errors.append(f"Record {i}: Missing required field '{field}'")
            
            # Validate field types
            if 'age' in record and not isinstance(record['age'], (int, float)):
                errors.append(f"Record {i}: 'age' must be a number")
            
            if 'email' in record and '@' not in record['email']:
                errors.append(f"Record {i}: Invalid email format")
    else:
        errors.append("JSON data should be a list of records")
    
    return errors

Working with Large Files

When dealing with large CSV or JSON files, memory-efficient approaches are necessary.

import csv
import json
import pandas as pd

# Process a large CSV file without loading it all into memory
def process_large_csv(csv_file):
    total_rows = 0
    age_sum = 0
    
    with open(csv_file, 'r', newline='') as file:
        csv_reader = csv.DictReader(file)
        for row in csv_reader:
            total_rows += 1
            age_sum += int(row.get('Age', 0))
            
            # Process one row at a time
            if total_rows % 10000 == 0:
                print(f"Processed {total_rows} rows...")
    
    if total_rows > 0:
        avg_age = age_sum / total_rows
        print(f"Average age: {avg_age:.2f}")
    
    return total_rows

# Process a large CSV file with Pandas in chunks
def process_csv_in_chunks(csv_file, chunk_size=10000):
    chunk_iter = pd.read_csv(csv_file, chunksize=chunk_size)
    
    total_rows = 0
    age_sum = 0
    
    for chunk in chunk_iter:
        total_rows += len(chunk)
        age_sum += chunk['Age'].sum()
        
        # Process each chunk
        print(f"Processed chunk, total rows so far: {total_rows}")
    
    if total_rows > 0:
        avg_age = age_sum / total_rows
        print(f"Average age: {avg_age:.2f}")
    
    return total_rows

# Process a large JSON file line by line (JSON Lines format)
def process_jsonl_file(jsonl_file):
    total_records = 0
    
    with open(jsonl_file, 'r') as file:
        for line in file:
            # Parse each line as a separate JSON object
            record = json.loads(line)
            
            # Process the record
            total_records += 1
            
            if total_records % 10000 == 0:
                print(f"Processed {total_records} records...")
    
    return total_records

Best Practices

CSV Best Practices

  • Always use the newline='' parameter when opening CSV files
  • Use csv.DictReader and csv.DictWriter for better code readability
  • Define explicit field names for DictWriter
  • Create custom dialects for non-standard CSV formats
  • Use Pandas for complex CSV manipulations
  • Process large files in chunks to avoid memory issues
  • Always validate and clean data before processing

JSON Best Practices

  • Use pretty-printing (indent parameter) for human-readable JSON
  • Create custom encoders for non-serializable types
  • Use json.loads() and json.dumps() for strings, json.load() and json.dump() for files
  • Validate JSON data structure before processing
  • For large datasets, consider using JSON Lines format (one JSON object per line)
  • Use json.tool module for command-line JSON formatting
  • Create schema validators for complex JSON structures

Security Considerations

  • Validate all input data, especially when reading from external sources
  • Be cautious with the eval() function when parsing data
  • Use json.loads() instead of eval() for JSON parsing
  • Be aware of potential CSV injection attacks in web applications
  • Sanitize data before writing to CSV, especially fields that might contain delimiters or quotes
  • When using JSON from untrusted sources, consider limiting the allowed object size

Practice Exercises

Try These:

  1. Create a CSV file with sample student data and read it using both csv.reader and csv.DictReader.
  2. Convert a CSV file to JSON format and back, ensuring the data integrity is maintained.
  3. Create a function that can merge multiple CSV files with the same structure into a single file.
  4. Build a JSON validator for a specific schema, like a user profile or a product catalog.
  5. Create a program that reads a large CSV file in chunks and performs statistical analysis on the data.
  6. Build a custom encoder and decoder for a JSON format that includes datetime objects and custom classes.
Back to Cheat Sheet