Data Cleaning and Processing Projects
20 free Data Cleaning and Processing projects with source code and tutorials.
Automated Data Quality Monitoring
Build a monitoring system that tracks data freshness, completeness, and schema drift.
Feature Engineering for ML
Create predictive features from raw data: encoding, binning, polynomial, and interaction.
Data Warehouse Schema Design
Design star and snowflake schemas with fact tables, dimensions, and slowly changing dims.
Real-Time Data Stream Processing
Process streaming data with Apache Kafka consumers and windowed aggregations.
Data Pipeline Testing Framework
Test data pipelines with input fixtures, output assertions, and regression checks.
Image Metadata Extraction
Extract and catalog EXIF metadata from images for organization and processing.
Database Data Migration Scripts
Migrate data between databases with schema mapping, validation, and rollback support.
Log File Processing Pipeline
Parse and analyze server logs with regex patterns, aggregation, and anomaly detection.
Data Anonymization & Masking
Implement PII detection, data masking, k-anonymity, and differential privacy techniques.
Geospatial Data Processing
Clean and process GPS coordinates, geocoding, and spatial joins with GeoPandas.
Time-Series Data Preprocessing
Handle missing timestamps, resampling, interpolation, and seasonal decomposition.
Data Profiling & EDA Automation
Auto-generate data profiles with statistics, distributions, correlations, and reports.
Web Data Extraction & Cleaning
Scrape, parse, and clean web data with HTML tag removal and structured extraction.
JSON & XML Data Transformation
Parse, transform, and flatten nested JSON/XML structures into tabular format.
Data Deduplication Engine
Build a fuzzy matching deduplication engine using Levenshtein distance and blocking.
Text Data Cleaning & Normalization
Clean text data with regex, Unicode normalization, stopword removal, and lemmatization.
CSV & Excel Processing Automation
Automate spreadsheet processing with openpyxl, csv module, and batch transformations.
Data Validation with Great Expectations
Implement data quality checks with expectations, validation suites, and data docs.
ETL Pipeline with Apache Airflow
Build automated ETL workflows with DAGs, task dependencies, and scheduling in Airflow.
Pandas Data Cleaning Masterclass
Clean messy datasets by handling nulls, duplicates, outliers, and data type corrections.