BUILD THE FUTURE WITH PLAY PORTFOLIO COMPANIES

Search

My job alerts

Senior Data Engineer

KGEN

Data Science

Bengaluru, Karnataka, India

Posted on Apr 24, 2026

Senior Data Engineer

Role Overview

We're looking for a Senior Data Engineer to own, extend, and scale the data infrastructure across our organisation. You'll work directly with the Head of Data as a key individual contributor, building and maintaining production-grade pipelines across AWS, Python, and cloud-native data services.

The primary focus is data engineering — designing and managing robust pipelines for multimodal and egocentric datasets — with some exposure to data science workflows. Onboarding will be phased, so you'll ramp up progressively rather than being expected to own everything from Day 1.

What You Will Work On

Core responsibilities:

Multimodal & egocentric data pipelines — build, maintain, and scale ingestion and processing pipelines for large-scale multimodal datasets (audio, video, image, code); own chunking logic, format conversion, metadata extraction, retry/error handling, and quality validation
AWS data lake — manage and extend the data lake: Athena query optimisation, Glue crawlers and cataloguing, Apache Hudi table management, Lake Formation column-level permissions, and S3 lifecycle policies
ETL and ingestion — build and maintain data ingestion pipelines from Google Forms, Twitch API, on-chain blockchain events (Aptos, BSC, Ethereum, Polygon), and third-party APIs into DynamoDB and PostgreSQL
Airflow DAG management — author, debug, optimise, and monitor Airflow DAGs for scheduled processing and pipeline orchestration
Infrastructure and access management — maintain AWS IAM, Lake Formation, and S3 bucket policies; manage data engineer access controls; troubleshoot Superset permissions and connectivity

Additional scope:

Cloud data transfers — manage large-scale S3-to-Google Drive transfers (rclone), cross-region data movement, and vendor data sharing infrastructure
QC and annotation tooling — support the FastAPI-backed QC portal used by annotation workers; extend data validation and quality-check scripts across egocentric video and audio datasets
Schema design — contribute to the Universal Data Schema (UDS) for audio, image, and code modalities in the Humyn Labs dataset marketplace
Data science support — collaborate with researchers on data preparation, feature extraction, and exploratory analysis workflows

What You Should Have

Required:

4+ years in a data engineering role with end-to-end pipeline ownership
Strong Python — async patterns, API clients, subprocess management, and data processing at scale
Hands-on production experience with AWS: Athena, Glue, S3, DynamoDB, and Lake Formation
Apache Hudi or Delta Lake experience — schema evolution, partition strategies, and table management
Strong SQL — able to write and optimise complex analytical queries across large datasets
Experience with Airflow or an equivalent workflow orchestrator — DAG authoring, debugging, and performance optimisation
Solid understanding of data modelling and lakehouse/warehouse design concepts
Familiarity with pipeline observability — alerting, logging, and data quality monitoring (e.g. Great Expectations, CloudWatch)
Working knowledge of Git and software engineering best practices
Basic Docker / containerisation experience

Good to have:

Experience working with multimodal or egocentric datasets (video, audio, image at scale)
Familiarity with blockchain data structures — on-chain events, wallet transactions, DEX swaps
Experience with rclone, large-scale file transfer, or cloud-to-cloud sync pipelines
Exposure to data science workflows — feature extraction, exploratory analysis, or working alongside ML/research teams

See more open positions at KGEN

Powered by Getro.com

Privacy policy Cookie policy