Lead Data Engineer

KGEN

KGEN

Software Engineering, Data Science

Bengaluru, Karnataka, India

Posted on Apr 16, 2026

Role Overview

We are looking for a Lead Data Engineer, own, and evolve the data infrastructure across the organisation. You will work directly with the Director of Data to set technical direction, mentor data engineers, and build production-grade systems across AWS, Python, and cloud-native data services.

This role combines hands-on engineering with strategic leadership — you will drive architectural decisions for ASR evaluation pipelines, blockchain data ingestion, API integrations, and data platform evolution while building and guiding the data engineering function.

What You Will Work On

Technical Leadership & Architecture

  • Define the technical roadmap for data infrastructure, including pipeline architecture patterns, tooling standards, and cloud data platform evolution
  • Lead architectural reviews and design decisions for new data systems and integrations
  • Establish engineering best practices: CI/CD for data pipelines, testing frameworks, code review standards, monitoring and observability patterns
  • Own the technical strategy for scaling data infrastructure to support 10x growth in data volume and downstream consumers

Team Leadership & Mentorship

  • Mentor and upskill data engineers; conduct code reviews, pair programming sessions, and technical guidance
  • Define hiring criteria and lead technical interviews for data engineering roles
  • Foster a culture of ownership, quality, and continuous improvement within the data team
  • Collaborate cross-functionally with ML engineers, backend engineers, and product teams to align data infrastructure with business objectives
  • Hands-On Engineering (60-70% of time)

    • Benchmark pipeline — own and evolve the multi-provider ASR transcription system; architect audio preprocessing workflows, chunking logic, retry/error handling, and metrics computation (WER, CER, BERTScore, PIER, DER, CS Precision/Recall)
    • AWS data lake — architect and manage the KGen data lake: design Athena query optimisation strategies, manage Glue crawlers and cataloguing, lead Apache Hudi table management, implement Lake Formation column-level permissions, and define S3 lifecycle policies
    • ETL and ingestion — design and build scalable data ingestion frameworks from Google Forms, Twitch API, on-chain blockchain events (Aptos, BSC, Ethereum, Polygon), and third-party gaming analytics APIs into DynamoDB and PostgreSQL
    • Airflow orchestration — architect DAG patterns, establish monitoring and alerting standards, debug complex pipeline failures, and optimise resource utilisation
    • Cloud data transfers — design and manage large-scale S3-to-Google Drive transfers (rclone), cross-region data movement strategies, and vendor data sharing infrastructure
    • Infrastructure and access management — own AWS IAM strategy, Lake Formation policies, and S3 bucket security; manage data engineer access controls; troubleshoot Superset permissions and connectivity issues
    • QC and annotation tooling — extend the FastAPI-backed audio QC portal; architect data validation frameworks and quality-check automation across egocentric video and audio datasets
    • Schema design & governance — lead the development of the Universal Data Schema (UDS) for audio, image, and code modalities in the Humyn Labs dataset marketplace; establish data governance and schema evolution practices.

    You Should Have

    Required

    • 7+ years in data engineering with 2+ years in a technical lead or senior individual contributor role with mentorship responsibilities
    • Proven leadership experience — either as a formal team lead or as a senior engineer who has mentored junior/mid-level engineers and driven technical direction
    • Deep Python expertise — async patterns, subprocess management, API clients, distributed data processing, testing frameworks, and production debugging
    • Advanced AWS proficiency — Athena, Glue, S3, DynamoDB, Lake Formation, IAM — with architectural decision-making experience (not just hands-on execution)
    • Apache Hudi or Delta Lake production experience — schema evolution, partition strategies, upserts, compaction, time travel queries
    • Strong SQL skills — query optimisation, indexing strategies, execution plan analysis for large-scale analytical workloads
    • Airflow expertise — DAG design patterns, custom operators, monitoring, resource management, and troubleshooting complex dependencies
    • System design thinking — ability to architect end-to-end data systems, evaluate trade-offs, and document technical decisions
    • Communication skills — able to articulate technical concepts to non-technical stakeholders and write clear design documents

    Strong Plus

    • Experience designing and scaling audio/media data pipelines (format conversion, metadata extraction, chunking, quality checks)
    • Blockchain data engineering experience (on-chain events, wallet transactions, DEX swaps, indexing strategies)
    • Large-scale file transfer and cloud-to-cloud sync pipelines (rclone, AWS DataSync, multi-cloud strategies)
    • Infrastructure-as-code experience (Terraform, CloudFormation)
    • Data quality frameworks and observability tools (Great Expectations, Monte Carlo, dbt)
    • Experience building internal data platforms or self-service analytics tools