CSV & JSON Handling in Python
Python provides powerful built-in libraries for working with CSV and JSON data, two of the most common
data exchange formats. The csv
module handles comma-separated values files, while the
json
module deals with JavaScript Object Notation files and strings.
CSV Handling
Reading CSV Files
The csv
module provides functions to read from and write to CSV files.
import csv # Reading a CSV file using csv.reader with open('data.csv', 'r', newline='') as file: csv_reader = csv.reader(file) # Skip the header row header = next(csv_reader) print(f"Column names: {header}") # Process each row for row in csv_reader: print(row) # row is a list of strings # Reading a CSV file using csv.DictReader (maps rows to dictionaries) with open('data.csv', 'r', newline='') as file: csv_reader = csv.DictReader(file) for row in csv_reader: print(row) # row is a dictionary with column names as keys print(f"Name: {row['name']}, Age: {row['age']}")
Writing CSV Files
You can write data to CSV files using csv.writer
or csv.DictWriter
.
import csv # Writing to a CSV file using csv.writer data = [ ['Name', 'Age', 'City'], # Header row ['Alice', '30', 'New York'], ['Bob', '25', 'Los Angeles'], ['Charlie', '35', 'Chicago'] ] with open('output.csv', 'w', newline='') as file: csv_writer = csv.writer(file) # Write all rows at once csv_writer.writerows(data) # Or write row by row # for row in data: # csv_writer.writerow(row) # Writing to a CSV file using csv.DictWriter data = [ {'Name': 'Alice', 'Age': 30, 'City': 'New York'}, {'Name': 'Bob', 'Age': 25, 'City': 'Los Angeles'}, {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'} ] with open('output_dict.csv', 'w', newline='') as file: fieldnames = ['Name', 'Age', 'City'] csv_writer = csv.DictWriter(file, fieldnames=fieldnames) # Write the header row csv_writer.writeheader() # Write all rows at once csv_writer.writerows(data) # Or write row by row # for row in data: # csv_writer.writerow(row)
CSV Dialects and Formatting
CSV files can have different delimiters, quote characters, and other formatting conventions.
The csv
module allows you to customize these through dialects or format parameters.
import csv # Reading a TSV (tab-separated values) file with open('data.tsv', 'r', newline='') as file: tsv_reader = csv.reader(file, delimiter='\t') for row in tsv_reader: print(row) # Creating a custom dialect csv.register_dialect('pipes', delimiter='|', quoting=csv.QUOTE_NONE, escapechar='\\') # Using the custom dialect with open('data_pipes.csv', 'r', newline='') as file: pipes_reader = csv.reader(file, dialect='pipes') for row in pipes_reader: print(row) # Customizing format parameters directly with open('custom.csv', 'w', newline='') as file: csv_writer = csv.writer( file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL, escapechar='\\', lineterminator='\n' ) csv_writer.writerow(['Field with, comma', 'Normal field', 'Field with "quotes"'])
Working with CSV in Pandas
For more advanced CSV handling, the Pandas library offers a powerful alternative with its DataFrame
class.
import pandas as pd # Reading a CSV file into a DataFrame df = pd.read_csv('data.csv') print(df.head()) # Display the first 5 rows # Selecting specific columns subset = df[['Name', 'Age']] # Filtering rows filtered = df[df['Age'] > 30] # Working with the data average_age = df['Age'].mean() print(f"Average age: {average_age:.2f}") # Writing a DataFrame to a CSV file filtered.to_csv('filtered_data.csv', index=False)
JSON Handling
Reading and Parsing JSON
The json
module provides functions to parse JSON strings into Python objects and vice versa.
import json # Reading from a JSON file with open('data.json', 'r') as file: data = json.load(file) print(data) # Python dictionary or list # Parsing a JSON string json_string = '{"name": "Alice", "age": 30, "city": "New York"}' parsed_data = json.loads(json_string) print(parsed_data['name']) # Accessing dictionary keys print(parsed_data.get('country', 'Unknown')) # Using get() with a default value
Writing JSON
You can convert Python objects to JSON strings or write them directly to files.
import json # Converting Python object to JSON string data = { 'name': 'Alice', 'age': 30, 'is_student': False, 'courses': ['Python', 'Data Science', 'Web Development'], 'address': { 'city': 'New York', 'zip': '10001' } } json_string = json.dumps(data) print(json_string) # Pretty printing with indentation pretty_json = json.dumps(data, indent=4, sort_keys=True) print(pretty_json) # Writing to a JSON file with open('output.json', 'w') as file: json.dump(data, file, indent=4)
Custom JSON Encoding and Decoding
For objects that are not directly serializable to JSON, you can create custom encoders and decoders.
import json from datetime import datetime # Custom JSON Encoder class CustomEncoder(json.JSONEncoder): def default(self, obj): if isinstance(obj, datetime): return obj.isoformat() elif hasattr(obj, 'to_json'): return obj.to_json() else: return super().default(obj) # Class with a to_json method class Person: def __init__(self, name, age, birth_date): self.name = name self.age = age self.birth_date = birth_date def to_json(self): return { 'name': self.name, 'age': self.age, 'birth_date': self.birth_date } # Using the custom encoder data = { 'current_time': datetime.now(), 'person': Person('Alice', 30, datetime(1992, 5, 12)) } json_string = json.dumps(data, cls=CustomEncoder, indent=4) print(json_string) # Custom JSON Decoder function def decode_datetime(dct): for key, value in dct.items(): if key.endswith('_date') and isinstance(value, str): try: dct[key] = datetime.fromisoformat(value) except ValueError: pass # Not a valid datetime string return dct # Using the custom decoder json_data = '{"name": "Bob", "birth_date": "1995-08-20T00:00:00"}' parsed_data = json.loads(json_data, object_hook=decode_datetime) print(parsed_data) print(type(parsed_data['birth_date']))
JSON with Pandas
Pandas provides convenient methods for working with JSON data as well.
import pandas as pd # Reading JSON into a DataFrame df = pd.read_json('data.json') print(df.head()) # Reading from a JSON string json_string = '[{"name": "Alice", "age": 30}, {"name": "Bob", "age": 25}]' df = pd.read_json(json_string) print(df) # Reading nested JSON nested_json = ''' { "people": [ {"name": "Alice", "age": 30}, {"name": "Bob", "age": 25} ], "location": { "city": "New York", "country": "USA" } } ''' df = pd.json_normalize(json.loads(nested_json), record_path=['people']) print(df) # Converting DataFrame to JSON json_result = df.to_json(orient='records') print(json_result) # Writing DataFrame to a JSON file df.to_json('output_pandas.json', orient='records', indent=4)
Common Tasks and Patterns
Converting Between CSV and JSON
You can convert data between CSV and JSON formats using both standard libraries and Pandas.
import csv import json import pandas as pd # CSV to JSON using standard libraries def csv_to_json(csv_file, json_file): data = [] with open(csv_file, 'r', newline='') as file: csv_reader = csv.DictReader(file) for row in csv_reader: data.append(row) with open(json_file, 'w') as file: json.dump(data, file, indent=4) # JSON to CSV using standard libraries def json_to_csv(json_file, csv_file): with open(json_file, 'r') as file: data = json.load(file) # Assume data is a list of dictionaries with the same keys if data and isinstance(data, list): fieldnames = data[0].keys() with open(csv_file, 'w', newline='') as file: csv_writer = csv.DictWriter(file, fieldnames=fieldnames) csv_writer.writeheader() csv_writer.writerows(data) # CSV to JSON using Pandas def csv_to_json_pandas(csv_file, json_file): df = pd.read_csv(csv_file) df.to_json(json_file, orient='records', indent=4) # JSON to CSV using Pandas def json_to_csv_pandas(json_file, csv_file): df = pd.read_json(json_file) df.to_csv(csv_file, index=False)
Data Validation and Cleaning
When working with CSV and JSON data, you often need to validate and clean the data before processing it.
import pandas as pd import numpy as np import json from datetime import datetime # Data cleaning with Pandas def clean_csv_data(csv_file): df = pd.read_csv(csv_file) # Handle missing values df['Age'] = df['Age'].fillna(df['Age'].mean()) df['Name'] = df['Name'].fillna('Unknown') # Convert data types df['Age'] = df['Age'].astype(int) # Remove duplicates df = df.drop_duplicates() # Custom transformations df['Name'] = df['Name'].str.title() # Capitalize names return df # JSON data validation def validate_json_data(json_data): required_fields = ['name', 'email', 'age'] errors = [] # Check if it's a valid JSON try: if isinstance(json_data, str): data = json.loads(json_data) else: data = json_data except json.JSONDecodeError: return ["Invalid JSON format"] # Validate each record if isinstance(data, list): for i, record in enumerate(data): # Check required fields for field in required_fields: if field not in record: errors.append(f"Record {i}: Missing required field '{field}'") # Validate field types if 'age' in record and not isinstance(record['age'], (int, float)): errors.append(f"Record {i}: 'age' must be a number") if 'email' in record and '@' not in record['email']: errors.append(f"Record {i}: Invalid email format") else: errors.append("JSON data should be a list of records") return errors
Working with Large Files
When dealing with large CSV or JSON files, memory-efficient approaches are necessary.
import csv import json import pandas as pd # Process a large CSV file without loading it all into memory def process_large_csv(csv_file): total_rows = 0 age_sum = 0 with open(csv_file, 'r', newline='') as file: csv_reader = csv.DictReader(file) for row in csv_reader: total_rows += 1 age_sum += int(row.get('Age', 0)) # Process one row at a time if total_rows % 10000 == 0: print(f"Processed {total_rows} rows...") if total_rows > 0: avg_age = age_sum / total_rows print(f"Average age: {avg_age:.2f}") return total_rows # Process a large CSV file with Pandas in chunks def process_csv_in_chunks(csv_file, chunk_size=10000): chunk_iter = pd.read_csv(csv_file, chunksize=chunk_size) total_rows = 0 age_sum = 0 for chunk in chunk_iter: total_rows += len(chunk) age_sum += chunk['Age'].sum() # Process each chunk print(f"Processed chunk, total rows so far: {total_rows}") if total_rows > 0: avg_age = age_sum / total_rows print(f"Average age: {avg_age:.2f}") return total_rows # Process a large JSON file line by line (JSON Lines format) def process_jsonl_file(jsonl_file): total_records = 0 with open(jsonl_file, 'r') as file: for line in file: # Parse each line as a separate JSON object record = json.loads(line) # Process the record total_records += 1 if total_records % 10000 == 0: print(f"Processed {total_records} records...") return total_records
Best Practices
CSV Best Practices
- Always use the
newline=''
parameter when opening CSV files - Use
csv.DictReader
andcsv.DictWriter
for better code readability - Define explicit field names for
DictWriter
- Create custom dialects for non-standard CSV formats
- Use Pandas for complex CSV manipulations
- Process large files in chunks to avoid memory issues
- Always validate and clean data before processing
JSON Best Practices
- Use pretty-printing (
indent
parameter) for human-readable JSON - Create custom encoders for non-serializable types
- Use
json.loads()
andjson.dumps()
for strings,json.load()
andjson.dump()
for files - Validate JSON data structure before processing
- For large datasets, consider using JSON Lines format (one JSON object per line)
- Use
json.tool
module for command-line JSON formatting - Create schema validators for complex JSON structures
Security Considerations
- Validate all input data, especially when reading from external sources
- Be cautious with the
eval()
function when parsing data - Use
json.loads()
instead ofeval()
for JSON parsing - Be aware of potential CSV injection attacks in web applications
- Sanitize data before writing to CSV, especially fields that might contain delimiters or quotes
- When using JSON from untrusted sources, consider limiting the allowed object size
Practice Exercises
Try These:
- Create a CSV file with sample student data and read it using both
csv.reader
andcsv.DictReader
. - Convert a CSV file to JSON format and back, ensuring the data integrity is maintained.
- Create a function that can merge multiple CSV files with the same structure into a single file.
- Build a JSON validator for a specific schema, like a user profile or a product catalog.
- Create a program that reads a large CSV file in chunks and performs statistical analysis on the data.
- Build a custom encoder and decoder for a JSON format that includes datetime objects and custom classes.