Senior Data Engineer

KGEN

KGEN

Data Science

Bengaluru, Karnataka, India

Posted on Apr 24, 2026

Senior Data Engineer

Role Overview

We're looking for a Senior Data Engineer to own, extend, and scale the data infrastructure across our organisation. You'll work directly with the Head of Data as a key individual contributor, building and maintaining production-grade pipelines across AWS, Python, and cloud-native data services.

The primary focus is data engineering — designing and managing robust pipelines for multimodal and egocentric datasets — with some exposure to data science workflows. Onboarding will be phased, so you'll ramp up progressively rather than being expected to own everything from Day 1.

What You Will Work On

Core responsibilities:

  • Multimodal & egocentric data pipelines — build, maintain, and scale ingestion and processing pipelines for large-scale multimodal datasets (audio, video, image, code); own chunking logic, format conversion, metadata extraction, retry/error handling, and quality validation
  • AWS data lake — manage and extend the data lake: Athena query optimisation, Glue crawlers and cataloguing, Apache Hudi table management, Lake Formation column-level permissions, and S3 lifecycle policies
  • ETL and ingestion — build and maintain data ingestion pipelines from Google Forms, Twitch API, on-chain blockchain events (Aptos, BSC, Ethereum, Polygon), and third-party APIs into DynamoDB and PostgreSQL
  • Airflow DAG management — author, debug, optimise, and monitor Airflow DAGs for scheduled processing and pipeline orchestration
  • Infrastructure and access management — maintain AWS IAM, Lake Formation, and S3 bucket policies; manage data engineer access controls; troubleshoot Superset permissions and connectivity

Additional scope:

  • Cloud data transfers — manage large-scale S3-to-Google Drive transfers (rclone), cross-region data movement, and vendor data sharing infrastructure
  • QC and annotation tooling — support the FastAPI-backed QC portal used by annotation workers; extend data validation and quality-check scripts across egocentric video and audio datasets
  • Schema design — contribute to the Universal Data Schema (UDS) for audio, image, and code modalities in the Humyn Labs dataset marketplace
  • Data science support — collaborate with researchers on data preparation, feature extraction, and exploratory analysis workflows

What You Should Have

Required:

  • 4+ years in a data engineering role with end-to-end pipeline ownership
  • Strong Python — async patterns, API clients, subprocess management, and data processing at scale
  • Hands-on production experience with AWS: Athena, Glue, S3, DynamoDB, and Lake Formation
  • Apache Hudi or Delta Lake experience — schema evolution, partition strategies, and table management
  • Strong SQL — able to write and optimise complex analytical queries across large datasets
  • Experience with Airflow or an equivalent workflow orchestrator — DAG authoring, debugging, and performance optimisation
  • Solid understanding of data modelling and lakehouse/warehouse design concepts
  • Familiarity with pipeline observability — alerting, logging, and data quality monitoring (e.g. Great Expectations, CloudWatch)
  • Working knowledge of Git and software engineering best practices
  • Basic Docker / containerisation experience



Good to have:

  • Experience working with multimodal or egocentric datasets (video, audio, image at scale)
  • Familiarity with blockchain data structures — on-chain events, wallet transactions, DEX swaps
  • Experience with rclone, large-scale file transfer, or cloud-to-cloud sync pipelines
  • Exposure to data science workflows — feature extraction, exploratory analysis, or working alongside ML/research teams