Data Engineering

Multi-Tenant Data Pipeline System

Highly isolated, Airflow-driven ingestion system with automated LLM fault classification.

Apache AirflowPostgreSQLGroq APIETL/ELT PipelinesDocker
// PROBLEM

Managing data ingestion for multiple clients requires strict tenant-level isolation, scalable data partitioning, and robust error handling to prevent manual debugging bottlenecks.

// APPROACH

Built a multi-tenant ETL system utilizing Apache Airflow and PostgreSQL with isolated, JSON-driven DAG configurations. Integrated the Groq API to automatically parse system logs and classify pipeline failures, enforcing a two-phase execution flow to validate AI suggestions.

// OUTCOME

Successfully processed over 100K records per day while cutting new-tenant onboarding time to under 5 minutes. Reduced mean pipeline resolution time from roughly 30 minutes to under 10 minutes with an 85%+ accuracy rate in failure classification.

Key Technical Highlights

JSON-driven DAG configurations enable strict tenant-level isolation without code changes

Groq API integration automatically parses system logs and classifies pipeline failures

Two-phase execution flow validates AI-suggested fixes before applying them

Processes 100K+ records per day with tenant-isolated partitioning

New-tenant onboarding reduced to under 5 minutes

Mean pipeline resolution time cut from ~30 min to under 10 min

85%+ accuracy rate in automated failure classification

Kumar Priyam | Data Engineering & Full-Stack Developer