BUILD THE FUTURE WITH PLAY PORTFOLIO COMPANIES

Search

My job alerts

Lead Data Engineer

KGEN

Software Engineering, Data Science

Bengaluru, Karnataka, India

Posted on Apr 16, 2026

Role Overview

We are looking for a Lead Data Engineer, own, and evolve the data infrastructure across the organisation. You will work directly with the Director of Data to set technical direction, mentor data engineers, and build production-grade systems across AWS, Python, and cloud-native data services.

This role combines hands-on engineering with strategic leadership — you will drive architectural decisions for ASR evaluation pipelines, blockchain data ingestion, API integrations, and data platform evolution while building and guiding the data engineering function.

What You Will Work On

Technical Leadership & Architecture

Define the technical roadmap for data infrastructure, including pipeline architecture patterns, tooling standards, and cloud data platform evolution
Lead architectural reviews and design decisions for new data systems and integrations
Establish engineering best practices: CI/CD for data pipelines, testing frameworks, code review standards, monitoring and observability patterns
Own the technical strategy for scaling data infrastructure to support 10x growth in data volume and downstream consumers

Team Leadership & Mentorship

Mentor and upskill data engineers; conduct code reviews, pair programming sessions, and technical guidance
Define hiring criteria and lead technical interviews for data engineering roles
Foster a culture of ownership, quality, and continuous improvement within the data team
Collaborate cross-functionally with ML engineers, backend engineers, and product teams to align data infrastructure with business objectives

Hands-On Engineering (60-70% of time)

Benchmark pipeline — own and evolve the multi-provider ASR transcription system; architect audio preprocessing workflows, chunking logic, retry/error handling, and metrics computation (WER, CER, BERTScore, PIER, DER, CS Precision/Recall)
AWS data lake — architect and manage the KGen data lake: design Athena query optimisation strategies, manage Glue crawlers and cataloguing, lead Apache Hudi table management, implement Lake Formation column-level permissions, and define S3 lifecycle policies
ETL and ingestion — design and build scalable data ingestion frameworks from Google Forms, Twitch API, on-chain blockchain events (Aptos, BSC, Ethereum, Polygon), and third-party gaming analytics APIs into DynamoDB and PostgreSQL
Airflow orchestration — architect DAG patterns, establish monitoring and alerting standards, debug complex pipeline failures, and optimise resource utilisation
Cloud data transfers — design and manage large-scale S3-to-Google Drive transfers (rclone), cross-region data movement strategies, and vendor data sharing infrastructure
Infrastructure and access management — own AWS IAM strategy, Lake Formation policies, and S3 bucket security; manage data engineer access controls; troubleshoot Superset permissions and connectivity issues
QC and annotation tooling — extend the FastAPI-backed audio QC portal; architect data validation frameworks and quality-check automation across egocentric video and audio datasets
Schema design & governance — lead the development of the Universal Data Schema (UDS) for audio, image, and code modalities in the Humyn Labs dataset marketplace; establish data governance and schema evolution practices.

You Should Have

Required

7+ years in data engineering with 2+ years in a technical lead or senior individual contributor role with mentorship responsibilities
Proven leadership experience — either as a formal team lead or as a senior engineer who has mentored junior/mid-level engineers and driven technical direction
Deep Python expertise — async patterns, subprocess management, API clients, distributed data processing, testing frameworks, and production debugging
Advanced AWS proficiency — Athena, Glue, S3, DynamoDB, Lake Formation, IAM — with architectural decision-making experience (not just hands-on execution)
Apache Hudi or Delta Lake production experience — schema evolution, partition strategies, upserts, compaction, time travel queries
Strong SQL skills — query optimisation, indexing strategies, execution plan analysis for large-scale analytical workloads
Airflow expertise — DAG design patterns, custom operators, monitoring, resource management, and troubleshooting complex dependencies
System design thinking — ability to architect end-to-end data systems, evaluate trade-offs, and document technical decisions
Communication skills — able to articulate technical concepts to non-technical stakeholders and write clear design documents

Strong Plus

Experience designing and scaling audio/media data pipelines (format conversion, metadata extraction, chunking, quality checks)
Blockchain data engineering experience (on-chain events, wallet transactions, DEX swaps, indexing strategies)
Large-scale file transfer and cloud-to-cloud sync pipelines (rclone, AWS DataSync, multi-cloud strategies)
Infrastructure-as-code experience (Terraform, CloudFormation)
Data quality frameworks and observability tools (Great Expectations, Monte Carlo, dbt)
Experience building internal data platforms or self-service analytics tools

See more open positions at KGEN

Powered by Getro.com

Privacy policy Cookie policy