
7+ Years

Remote

Remote

Full Time
As the AI Systems Architect, you’ll own the end-to-end design and delivery of production-grade agentic and Generative AI systems. This is a highly hands-on role requiring deep architectural insight, coding proficiency, and an obsession with performance, scalability, and reliability. You’ll architect secure, cost-efficient AI platforms on AWS, guide developers through complex debugging and optimization, and ensure all systems are observable, governed, and production-ready.
Required Qualifications
Education: Bachelor’s/Master’s from a top-tier institute (IIT/Tier-1) in Computer Science, AI, or related field.
7–10 years in software/AI engineering, including 4+ years in GenAI application development and 2+ years architecting agentic AI systems.
Expert in Python 3.11+ (asyncio, typing, packaging, profiling, pytest).
Hands-on experience with Semantic Kernel, LangGraph, AutoGen, or CrewAI.
Proven delivery of GenAI/RAG systems on AWS Bedrock or equivalent vector-based platforms (OpenSearch Serverless, Pinecone, Redis).
Deep understanding of AWS ecosystem: EKS, Bedrock, S3, SQS/SNS, RDS, ElastiCache, Secrets Manager, IAM/Okta, Kong API Gateway

Key Responsibilities
Architect Production AI Systems: Design robust overall architectures for agentic systems (planning, reasoning, tool-calling), GenAI/RAG pipelines, and evaluation workflows. Create detailed design documents including flow/UML/sequence diagrams and AWS deployment topologies. Additionally, ensure architectures support advanced LLM training and inference workflows, incorporating distributed strategies for scalability.
Optimize for Cost & Performance: Model throughput, latency, concurrency, autoscaling, CPU/GPU sizing, and vector index performance to ensure scalable, efficient deployments. Include optimization for multi-node GPU clusters and distributed training efficiency to reduce compute overhead.
Lead Debugging & Stability Efforts: Conduct deep-dive debugging, fix critical defects, and resolve production incidents; pair-program with developers to improve code quality and performance. Apply MLOps-driven stability practices, leveraging configuration management and automated recovery for high availability.
Standardize Agentic Frameworks: Build reference implementations using Semantic Kernel (preferred), LangGraph, AutoGen, or CrewAI with strong schema validation, grounding, and memory management.
Implement Observability & Monitoring: Set up distributed tracing, metrics, and logging via OpenTelemetry and Datadog. Standardize dashboards, alerts, and incident response workflows.
Govern Evaluation & Rollouts: Build test and evaluation frameworks—golden sets, A/B experiments, regression suites, and controlled rollouts—to ensure consistent quality across releases.
Establish Engineering Standards: Create reusable SDKs, connectors, CI/CD templates, and architecture review checklists to promote consistency across teams.
Cross-Functional Leadership: Collaborate with product, data, and SRE teams for capacity planning, DR strategies, and post-incident RCA reviews. Mentor engineers to strengthen design and reliability practices

Immediate to 15 Days
Apply for this Job
By submitting this form, I confirm that I have read and agree to the
Privacy Policy.