Data Cleaning and Processing Projects

20 free Data Cleaning and Processing projects with source code and tutorials.

Screenshot of Automated Data Quality Monitoring project
Data CleaningDec 28, 2025

Automated Data Quality Monitoring

Build a monitoring system that tracks data freshness, completeness, and schema drift.

PythonData QualityMonitoringAlerting
Read more → Source
Screenshot of Feature Engineering for ML project
Data CleaningDec 23, 2025

Feature Engineering for ML

Create predictive features from raw data: encoding, binning, polynomial, and interaction.

PythonFeature EngineeringMLPreprocessing
Read more → Source
Screenshot of Data Warehouse Schema Design project
Data CleaningDec 19, 2025

Data Warehouse Schema Design

Design star and snowflake schemas with fact tables, dimensions, and slowly changing dims.

SQLData WarehouseStar SchemaDimensional
Read more → Source
Screenshot of Real-Time Data Stream Processing project
Data CleaningDec 15, 2025

Real-Time Data Stream Processing

Process streaming data with Apache Kafka consumers and windowed aggregations.

PythonKafkaStreamingReal-time
Read more → Source
Screenshot of Data Pipeline Testing Framework project
Data CleaningDec 11, 2025

Data Pipeline Testing Framework

Test data pipelines with input fixtures, output assertions, and regression checks.

PythonTestingPipelineData Quality
Read more → Source
Screenshot of Image Metadata Extraction project
Data CleaningDec 7, 2025

Image Metadata Extraction

Extract and catalog EXIF metadata from images for organization and processing.

PythonEXIFMetadataImage Processing
Read more → Source
Screenshot of Database Data Migration Scripts project
Data CleaningDec 3, 2025

Database Data Migration Scripts

Migrate data between databases with schema mapping, validation, and rollback support.

PythonMigrationSQLETL
Read more → Source
Screenshot of Log File Processing Pipeline project
Data CleaningNov 29, 2025

Log File Processing Pipeline

Parse and analyze server logs with regex patterns, aggregation, and anomaly detection.

PythonLog AnalysisRegexPipeline
Read more → Source
Screenshot of Data Anonymization & Masking project
Data CleaningNov 25, 2025

Data Anonymization & Masking

Implement PII detection, data masking, k-anonymity, and differential privacy techniques.

PythonPrivacyAnonymizationPII
Read more → Source
Screenshot of Geospatial Data Processing project
Data CleaningNov 21, 2025

Geospatial Data Processing

Clean and process GPS coordinates, geocoding, and spatial joins with GeoPandas.

PythonGeoPandasGeospatialMapping
Read more → Source
Screenshot of Time-Series Data Preprocessing project
Data CleaningNov 17, 2025

Time-Series Data Preprocessing

Handle missing timestamps, resampling, interpolation, and seasonal decomposition.

PythonTime-SeriesPreprocessingPandas
Read more → Source
Screenshot of Data Profiling & EDA Automation project
Data CleaningNov 13, 2025

Data Profiling & EDA Automation

Auto-generate data profiles with statistics, distributions, correlations, and reports.

PythonPandas ProfilingEDAStatistics
Read more → Source
Screenshot of Web Data Extraction & Cleaning project
Data CleaningNov 9, 2025

Web Data Extraction & Cleaning

Scrape, parse, and clean web data with HTML tag removal and structured extraction.

PythonBeautifulSoupWeb ScrapingCleaning
Read more → Source
Screenshot of JSON & XML Data Transformation project
Data CleaningNov 5, 2025

JSON & XML Data Transformation

Parse, transform, and flatten nested JSON/XML structures into tabular format.

PythonJSONXMLTransform
Read more → Source
Screenshot of Data Deduplication Engine project
Data CleaningNov 1, 2025

Data Deduplication Engine

Build a fuzzy matching deduplication engine using Levenshtein distance and blocking.

PythonDeduplicationFuzzy MatchingData Quality
Read more → Source
Screenshot of Text Data Cleaning & Normalization project
Data CleaningOct 28, 2025

Text Data Cleaning & Normalization

Clean text data with regex, Unicode normalization, stopword removal, and lemmatization.

PythonNLPRegexText Processing
Read more → Source
Screenshot of CSV & Excel Processing Automation project
Data CleaningOct 23, 2025

CSV & Excel Processing Automation

Automate spreadsheet processing with openpyxl, csv module, and batch transformations.

PythonCSVExcelAutomation
Read more → Source
Screenshot of Data Validation with Great Expectations project
Data CleaningOct 19, 2025

Data Validation with Great Expectations

Implement data quality checks with expectations, validation suites, and data docs.

PythonGreat ExpectationsValidationQuality
Read more → Source
Screenshot of ETL Pipeline with Apache Airflow project
Data CleaningOct 15, 2025

ETL Pipeline with Apache Airflow

Build automated ETL workflows with DAGs, task dependencies, and scheduling in Airflow.

PythonAirflowETLPipeline
Read more → Source
Screenshot of Pandas Data Cleaning Masterclass project
Data CleaningOct 11, 2025

Pandas Data Cleaning Masterclass

Clean messy datasets by handling nulls, duplicates, outliers, and data type corrections.

PythonPandasData CleaningEDA
Read more → Source