Aziz Kurbanov

Principal Platform Architect & Open Source Author


Professional Highlights

Agentic AI & MLOps Systems

Distributed Robotics & Edge Computing

Cloud Infrastructure & Platform Engineering

Kubernetes & Container Orchestration

Bare Metal & Edge Infrastructure

Infrastructure as Code & GitOps

Observability & Monitoring

AI/GPU Infrastructure

Security & Compliance

Test Automation & Quality Engineering

As first QA hire at multiple organizations, established QA processes, test infrastructure, and automation frameworks. Led test automation across E-commerce, mobile, firmware, IoT, and ERP integrations. Created comprehensive CI/CD testing pipelines with Kubernetes test pods, Datadog instrumentation, and PagerDuty incident triggers.


Technology Stack

Cloud Platforms

AWS (EKS, EC2, Lambda, ECS Fargate, Aurora, RDS, S3, Route53, ECR, Organizations automation, Auto Scaling, CloudWatch) with expertise in multi-cloud architecture patterns. Managed infrastructure at petabyte scale with Graviton instances for energy efficiency and cost optimization. Azure (AKS, Database Services, Network Watcher, Log Analytics, Autoscaling) with Just-In-Time Access implementation and multi-account governance. GCP (GKE, Cloud Run, Cloud Build, Autoscaling) for MLOps infrastructure and serverless deployments. Multi-cloud migration experience including EKS-to-AKS migrations across thousands of clusters.

Kubernetes & Orchestration

Kubernetes cluster design, scaling, and production operations across AWS (EKS), Azure (AKS), and GCP (GKE). Helm and Kustomize for templating and customization. ArgoCD and Flux CD for declarative GitOps deployments. Argo Workflows for complex workflow orchestration. KubeEdge for edge computing at network periphery. Kopf-based operators for custom Kubernetes controllers and resource management. Istio for service mesh and advanced traffic management. Cilium CNI for eBPF-based networking with advanced L2/L3 capabilities. Multi-tenancy patterns, namespace isolation, RBAC security models, and cost optimization strategies.

Infrastructure as Code & GitOps

Terraform for infrastructure provisioning and multi-cloud deployments. Ansible for configuration management and automation. GitHub Actions for CI/CD pipelines. Everything-as-Code culture following 12 Factor Manifesto principles. GitOps workflows enabling reproducible, auditable cluster provisioning with complete traceability. Flux CD and ArgoCD for declarative infrastructure management. Velero for backup/restore and disaster recovery. AWS Organizations automation for multi-account governance. Reduced MTTR (Median Time To Release) by 98.96% through comprehensive automation. VPC/subnet architecture, load balancing strategies, VPN/Wireguard configurations, and network segmentation across cloud providers.

Databases & Data

PostgreSQL with pgvector for semantic search and RAG systems. MySQL with persistence layer optimization. Redis for KV optimization, memory efficiency, and distributed caching. Apache Kafka for real-time agent communication and event streaming at scale. RabbitMQ for message queuing and async processing. MinIO object store for artifact storage and S3-compatible persistence. Qdrant and Pinecone for vector databases enabling semantic search. Amazon Kinesis and Firehose for real-time data streaming. Amazon Redshift for analytics and data warehousing. Google BigQuery for distributed analytics. Parquet for columnar data storage and efficient analytics. High-volume workload management exceeding 10 petabytes in traffic with optimized data pipelines.

AI/MLOps & LLMs

GPU acceleration with CUDA and MIG (Multi-Instance GPU) configurations. Production experience managing 500+ GPU nodes with 8x H100s per node in large-scale inference clusters. LLM deployments (HuggingFace, TensorFlow, PyTorch, LM Studio) supporting Text, Audio, and Visual modalities. KV cache optimization for LLM inference performance and throughput maximization. SLMs (Small Language Models) for edge deployment. RAG (Retrieval-Augmented Generation) systems with vector databases. AI Agents and agentic orchestration with A2A (Agent-to-Agent) patterns. SageMaker for managed ML operations. Slurm and vLLM for distributed inference. Kubeflow and Metaflow for ML workflow orchestration. Custom artifact versioning tooling for reproducible ML experiments. Advanced inference optimization including quantization, dynamic batching, and scheduling.

Observability & Monitoring

Datadog for comprehensive observability across infrastructure and applications. Prometheus for metrics collection and time-series storage. Grafana for visualization and dashboard creation. ELK Stack (Elasticsearch, Logstash, Kibana) for log aggregation and analysis. Netdata for real-time system monitoring. CloudWatch for AWS-native monitoring and alerting. Distributed tracing for end-to-end visibility across microservices. SLO/SLA definition, monitoring, and enforcement. Log aggregation pipelines and alerting hierarchies. Synthetic transaction monitoring and real user monitoring (RUM). Proactive alerting and incident response workflows. Custom dashboards and runbook automation.

Security & Compliance

Lacework for cloud security and compliance automation. GuardDuty for threat detection and anomaly analysis. OWASP ZAP for automated dynamic security testing. WAF (Web Application Firewall) deployment and management. DevSecOps practices with automated security scanning throughout CI/CD. Keycloak SSO with MFA (Multi-Factor Authentication). AWS IAM, Azure AD, and Okta integration via OIDC/SAML. IAM/RBAC design and implementation. Just-In-Time Access for privileged operations. Britive for identity and access governance. CloudTrail for comprehensive audit logging and compliance tracking. Wireguard VPN for encrypted network communication. ISO 27001 certification leadership and recertification. SOC 2 Type II readiness and compliance architecture. NIST SP 800-53 control implementation and continuous monitoring. Security-first architecture patterns with zero-trust principles.

Languages & Tools

Python (Django, Flask) for backend systems and infrastructure automation. Bash for scripting and DevOps automation. Git for version control and GitOps workflows. Docker for containerization with DockerHub and ECR registries. Terraform and Ansible for infrastructure automation. Selenium and Appium for cross-browser and mobile test automation. JMETER for performance and load testing. Linux for operating systems and container hosts. ROS2 and FastDDS for robotics applications and distributed systems. eBPF and Cilium CNI for advanced networking and observability. TCP/IP, HTTP/MQTT protocols. Wireshark for network packet analysis and debugging. Advanced understanding of OSI model, BGP routing, L2/L3 networking for inter-cluster communication and complex network architectures.


Open Source Projects

robotics-k8s-infra

Production-grade Kubernetes platform for cloud/edge robotics. Features KubeEdge, ROS2, ArgoCD, multi-domain DDS networking, eBPF-based Cilium CNI for UDP multicast discovery, CLI tools for edge node lifecycle management, and GitOps-first provisioning.

AQE (Agentic Quality Engineering Platform)

MLOps-driven agentic platform orchestrating distributed AI agents for web state capture, LLM-powered change detection, and RAG-enabled Playwright test generation. Achieves reproducible end-to-end test automation for complex web applications.

ESDDNS

Kubernetes kopf-based operator for automated IPv4 DNS A record drift detection and state management using finite state machine modeling. Includes Grafana Cloud monitoring, CI/CD pipelines with GitHub Actions, and bare-metal microk8s cluster automation.

urlstatus

Fully agentic, A2A-compliant web crawler agent with Python, Flask, and modular async crawling logic. Exposes CLI and HTTP+JSON/JSON-RPC interfaces. Features Analyzer Agent (LLM-driven failure analysis) and GitHub Code Analysis Agent with seamless GitHub API integration.

fizmatmod

Physics-based framework applying classical mechanics principles (Hamiltonian and Lagrangian formulations) to software project modeling. Enables predictive analysis of project trajectories, quantifies team momentum and project friction as measurable, optimizable values.


Platform & SRE Leadership

As a platform architect and SRE systems leader, I've consistently championed infrastructure excellence and engineering velocity. My leadership philosophy centers on building self-service platform ecosystems that empower teams—eliminating toil, automating lifecycle operations, and establishing infrastructure as the foundation for organizational scaling. I maintain active technical roadmaps, anticipate future infrastructure trends, and architect platforms that scale ahead of organizational growth while leveraging best-in-class tooling and emerging technologies.

Key infrastructure initiatives include:


Recognition & Major Initiatives


Contact


Open to discussing platform engineering challenges, infrastructure architecture, AI/MLOps systems, and open source collaboration.