Level up Your Career

Ready to catapult your career and change the world?
Join 4LABS!

Home

Careers

Principal Scientist

Job Published : 20-04-2026

Principal Scientist

7+ Years

Remote

Full Time

Share now

Job Description

As the AI Systems Architect, you’ll own the end-to-end design and delivery of production-grade agentic and Generative AI systems. This is a highly hands-on role requiring deep architectural insight, coding proficiency, and an obsession with performance, scalability, and reliability. You’ll architect secure, cost-efficient AI platforms on AWS, guide developers through complex debugging and optimization, and ensure all systems are observable, governed, and production-ready. Required Qualifications Education: Bachelor’s/Master’s from a top-tier institute (IIT/Tier-1) in Computer Science, AI, or related field. 7–10 years in software/AI engineering, including 4+ years in GenAI application development and 2+ years architecting agentic AI systems. Expert in Python 3.11+ (asyncio, typing, packaging, profiling, pytest). Hands-on experience with Semantic Kernel, LangGraph, AutoGen, or CrewAI. Proven delivery of GenAI/RAG systems on AWS Bedrock or equivalent vector-based platforms (OpenSearch Serverless, Pinecone, Redis). Deep understanding of AWS ecosystem: EKS, Bedrock, S3, SQS/SNS, RDS, ElastiCache, Secrets Manager, IAM/Okta, Kong API Gateway

Responsibilities

Key Responsibilities Architect Production AI Systems: Design robust overall architectures for agentic systems (planning, reasoning, tool-calling), GenAI/RAG pipelines, and evaluation workflows. Create detailed design documents including flow/UML/sequence diagrams and AWS deployment topologies. Additionally, ensure architectures support advanced LLM training and inference workflows, incorporating distributed strategies for scalability. Optimize for Cost & Performance: Model throughput, latency, concurrency, autoscaling, CPU/GPU sizing, and vector index performance to ensure scalable, efficient deployments. Include optimization for multi-node GPU clusters and distributed training efficiency to reduce compute overhead. Lead Debugging & Stability Efforts: Conduct deep-dive debugging, fix critical defects, and resolve production incidents; pair-program with developers to improve code quality and performance. Apply MLOps-driven stability practices, leveraging configuration management and automated recovery for high availability. Standardize Agentic Frameworks: Build reference implementations using Semantic Kernel (preferred), LangGraph, AutoGen, or CrewAI with strong schema validation, grounding, and memory management. Implement Observability & Monitoring: Set up distributed tracing, metrics, and logging via OpenTelemetry and Datadog. Standardize dashboards, alerts, and incident response workflows. Govern Evaluation & Rollouts: Build test and evaluation frameworks—golden sets, A/B experiments, regression suites, and controlled rollouts—to ensure consistent quality across releases. Establish Engineering Standards: Create reusable SDKs, connectors, CI/CD templates, and architecture review checklists to promote consistency across teams. Cross-Functional Leadership: Collaborate with product, data, and SRE teams for capacity planning, DR strategies, and post-incident RCA reviews. Mentor engineers to strengthen design and reliability practices