About the Role:

- We are looking for an exceptional Data Architect to help design, scale, and optimize the next generation of India's open-commerce data infrastructure. At ONDC, we process 10 TB of data daily, maintain 1 billion+ live entities, and handle 300K+ requests per second - operating at true internet scale.

- You will work at the heart of this system, shaping data platforms that meet all the three Vs of big data: volume, velocity, and variety.

Key Responsibilities:

Architect & Evolve Modern Data Platforms:

- Design, optimize, and scale data lakehouse architectures across multi-cloud environments (AWS, GCP, Azure), supporting both ad-hoc analytics and batch workloads.

Medallion & Lakehouse Design:

- Implement and evolve bronze-silver-gold (medallion) data pipelines using columnar formats (Parquet, ORC, Iceberg) ensuring cost-efficient, schema-evolution-friendly storage.

Data Discovery & Schema Governance:

- Build and maintain central schema repositories and data discovery services leveraging technologies like AWS Glue, Hive Metastore, Apache Iceberg, and Delta Catalogs.

Streaming Architecture:

- Architect and deploy real-time data streaming frameworks using Kafka or Pulsar, ensuring low-latency data flow, schema validation, and replayability.

Performance & Cost Optimization:

- Identify and implement cost-saving measures across data pipelines - data compression, columnar optimization, storage tiering, and query performance tuning.

Data Orchestration & Workflow Automation:

- Build and maintain orchestration frameworks using Airflow, Dagster, or Argo Workflows, ensuring observability, failure recovery, and lineage tracking.

OLAP Systems & Query Acceleration:

- Design and tune analytical workloads using Snowflake, Redshift, ClickHouse, or Druid, supporting large-scale aggregation and real-time exploration.

DevOps & Infrastructure as Code:

- Collaborate with DevOps teams to define reproducible infrastructure using Terraform / CloudFormation / Pulumi. Deploy, monitor, and optimize Kubernetes-based data services.

Natural Language Querying:

- Contribute to frameworks enabling natural-language analytics, integrating LLM-powered question-answer systems over structured data.

Required Skills & Experience

- 8+ years of experience in data architecture, platform engineering, or distributed systems.

- Proven expertise in Spark, Snowflake, Redshift, and SQL-based data modeling.

- Hands-on experience with streaming (Kafka/Pulsar) and batch processing frameworks.

- Deep understanding of cloud-native and cloud-agnostic architectures.

- Practical experience implementing lakehouse / medallion models.

- Strong grasp of data lineage, cataloging, governance, and schema evolution.

- Exposure to columnar formats (Parquet/ORC) and query engines (Presto/Trino/DuckDB).

- Familiarity with Kubernetes, Docker, and microservices-based data deployment.

- Excellent problem-solving, documentation, and cross-team collaboration skills.

Preferred Qualifications

- Experience in designing central schema and data discovery layers at scale.

- Prior exposure to large-scale public data networks (e.g., e-commerce, fintech, telecom).

- Understanding of AI/LLM-driven data access or semantic query layers.

- Contributions to open-source data infrastructure projects (nice to have).

Why Join Us

- Shape the data backbone of India's digital commerce revolution.

- Work on massive-scale systems (10 TB/day, 1B+ entities, 300K RPS).

- Collaborate with leading data engineers, architects, and policymakers.

- Innovate at the intersection of data engineering, AI, and open networks.