We are seeking an experienced Cloud Engineer to design, implement, and manage our multi-cloud infrastructure. The ideal candidate will have deep expertise in cloud platforms, container orchestration, infrastructure automation, CI/CD pipelines, and observability solutions, ensuring scalable, reliable, and cost-effective cloud operations across multiple cloud providers.
Principal Duties and Responsibilities
Cloud Infrastructure Management
- Design, deploy, and manage multi-cloud infrastructure across Google Cloud Platform (GCP), Amazon Web Services (AWS), Azure, and Oracle Cloud Infrastructure (OCI)
- Architect and implement highly available, fault-tolerant, and scalable cloud solutions
- Manage cloud resources including compute instances and networking components
- Design and implement disaster recovery and business continuity plans for cloud workloads
- Migrate on-premises applications and services to cloud environments with minimal disruption
- Optimize cloud resource utilization and implement auto-scaling policies
- Maintain comprehensive documentation of cloud architectures, configurations, and runbooks
Kubernetes & Container Orchestration
- Design, deploy, and manage production-grade Kubernetes clusters across multiple cloud providers
- Implement and maintain container orchestration strategies for microservices architectures
- Configure and manage Kubernetes resource objects
- Manage Kubernetes cluster upgrades, scaling, and performance optimization
- Troubleshoot complex container and orchestration issues in production environments
- Implement multi-cluster and multi-region Kubernetes deployments for high availability
Service Mesh & Advanced Networking
- Design, deploy, and manage Istio service mesh for microservices communication and observability
- Configure Istio traffic management, including virtual services, destination rules, and gateways
- Implement advanced traffic routing (canary deployments, A/B testing, traffic splitting) using Istio
- Deploy and manage Istio observability components (telemetry, distributed tracing, service graphs)
- Implement circuit breaking, retries, timeouts, and fault injection for resilience testing
- Configure Istio ingress and egress gateways for external traffic management
- Monitor and optimize service mesh performance and resource utilization
- Implement multi-cluster service mesh architectures across different cloud providers
Reverse Proxy & Load Balancing
- Deploy, configure, and manage HAProxy for high-performance load balancing and reverse proxy
- Implement HAProxy ACLs, backend routing, health checks, and session persistence
- Design and implement Nginx as reverse proxy for web applications and API gateways
- Configure Nginx for rate limiting and request filtering
- Implement Nginx load balancing algorithms and upstream health monitoring
- Manage Nginx Plus features for advanced traffic management and monitoring
- Optimize HAProxy and Nginx performance for high-throughput environments
Infrastructure as Code & Configuration Management
- Develop and maintain infrastructure as code using Terraform
- Create reusable, modular Terraform configurations for various cloud resources and Implement Terraform state management and remote backends
- Design and implement configuration management solutions using Ansible
- Develop Ansible playbooks and roles for automated server provisioning and configuration
- Integrate Terraform and Ansible workflows for end-to-end infrastructure automation
- Implement infrastructure version control, code review processes, and GitOps practices
- Manage infrastructure drift detection and remediation
- Create and maintain infrastructure documentation and architecture diagrams
- Implement policy-as-code using tools like OPA (Open Policy Agent) or Sentinel
CI/CD Pipeline Management
- Design, implement, and maintain continuous integration pipelines using Jenkins and Harness
- Optimize build times and pipeline efficiency
- Integrate security scanning (SAST, DAST, container scanning) into CI/CD pipelines
- Configure Jenkins jobs, pipelines, and shared libraries for automated build, configure build agents, runners, and execution environments
- Implement Harness deployment pipelines for cloud-native applications
- Integrate CI/CD pipelines with version control systems (Git, GitHub, GitLab)
- Implement continuous deployment workflows using ArgoCD for Kubernetes-based applications
- Design and implement GitOps workflows with ArgoCD for declarative application delivery
- Manage ArgoCD application definitions, sync policies and multi-cluster deployments
- Implement progressive delivery strategies (blue-green deployments, canary releases) using ArgoCD
Message Streaming & Event-Driven Architecture
- Deploy and manage Apache Kafka clusters for real-time data streaming and event-driven architectures
- Configure Kafka topics, partitions, replication factors, and retention policies
- Implement Kafka Connect for data integration with various sources and sinks
- Monitor Kafka cluster health, performance metrics, and consumer lag
- Optimize Kafka performance for high-throughput and low-latency use cases
- Troubleshoot Kafka producer and consumer issues
Database & Proxy Management
- Deploy, configure, and manage ProxySQL for MySQL load balancing and high availability
- Implement query routing, caching, and connection pooling strategies using ProxySQL
- Optimize database performance through ProxySQL query analysis and optimization
- Implement database failover and disaster recovery using ProxySQL
- Monitor ProxySQL metrics and troubleshoot connection and performance issues
- Integrate ProxySQL with database clusters and replication topologies
- Implement database access security and audit logging through ProxySQL
Cloud Networking
- Design and implement cloud networking architectures, including VPCs, subnets, and network segmentation
- Configure and manage cloud load balancers (Application Load Balancers, Network Load Balancers, Cloud Load Balancing)
- Implement VPN connections, Direct Connect/Interconnect, and hybrid cloud networking solutions
- Implement network security controls, including security groups, network ACLs, and firewall rules
- Implement network monitoring and traffic analysis
- Troubleshoot complex networking issues across multi-cloud environments
- Design and implement private connectivity between cloud providers
Secrets Management & Security
- Configure and manage HashiCorp Vault for centralized secrets management across multi-cloud environments
- Configure Vault secret engines (KV, database, PKI, AWS, GCP, Azure dynamic secrets)
- Manage Vault high availability clusters and disaster recovery procedures
- Implement dynamic database credentials and secret rotation strategies
- Manage Vault encryption as a service for application-level encryption
- Implement Vault agent and sidecar injectors for Kubernetes workloads
- Migrate secrets from legacy systems to Vault
Qualifications, Competency & Skills Required
Education & Experience
- Bachelor's degree or diploma in Computer Science, Information Technology, Engineering, or related field
- Minimum of 5 years of proven experience in cloud engineering, DevOps, or platform engineering roles
- Hands-on experience managing production workloads across multiple cloud platforms
- Relevant cloud and technology certifications are highly desirable
Technical Skills
Cloud Platforms (Required)
- Google Cloud Platform (GCP): Deep expertise in Compute Engine, GKE, Cloud Storage, Cloud SQL, VPC, Cloud Functions, Cloud Run, IAM
- Amazon Web Services (AWS): Proficiency in EC2, EKS, S3, RDS, VPC, Lambda, ECS, CloudFormation, IAM
- Microsoft Azure: Experience with Virtual Machines, AKS, Blob Storage, Azure SQL, Virtual Networks, Azure Functions, ARM templates
- Oracle Cloud Infrastructure (OCI): Familiarity with Compute, OKE, Object Storage, networking, and OCI-specific services
- Multi-cloud architecture design and implementation experience
- Cloud migration strategies and execution (lift-and-shift, re-platforming, re-architecting)
Container & Orchestration (Required)
- Expert-level Kubernetes knowledge, including cluster architecture, networking, storage, and security
- Hands-on experience with managed Kubernetes services (GKE, EKS, AKS)
- Proficiency in Docker containerization, image optimization, and registry management
- Experience with Helm charts for application packaging and deployment
- Knowledge of container runtime environments (containerd, CRI-O)
Service Mesh & Microservices (Required)
- Istio: Deep expertise in Istio architecture, deployment, and operations
- Istio traffic management (virtual services, destination rules, gateways, service entries)
- Istio security features (mTLS, authorization policies, peer authentication, request authentication)
- Istio observability and telemetry configuration
- Multi-cluster and multi-mesh deployments
- Service mesh troubleshooting and performance optimization
- Understanding of sidecar proxy patterns and Envoy proxy
- Experience with other service mesh solutions (Linkerd, Consul Connect) is a plus
Reverse Proxy & Load Balancing (Required)
- HAProxy Advanced configuration and management for load balancing and high availability
- Nginx Expert-level configuration as reverse proxy and API gateway
- Nginx rate limiting, and performance tuning
- Nginx load balancing algorithms and upstream configurations
- Experience with Nginx modules and custom configurations
- High availability configurations using keepalived, VRRP, or similar
- Integration with Kubernetes ingress controllers (Nginx Ingress, Istio Ingress)
Infrastructure as Code (Required)
- Advanced Terraform skills for multi-cloud infrastructure provisioning
- Terraform module development, state management, and workspace strategies
- Proficiency in Ansible for configuration management and automation
- Ansible playbook development, roles, and inventory management
- Experience with version control systems (Git) and GitOps workflows
- Infrastructure testing frameworks (Terratest, Kitchen-Terraform)
CI/CD Tools (Required)
- Jenkins: Pipeline development (declarative and scripted), shared libraries, plugin management
- Harness: Deployment pipeline configuration, workflow creation, approval gates
- ArgoCD: GitOps workflows, application synchronization, multi-cluster management
- Integration of CI/CD tools with Kubernetes and cloud platforms
- Automated testing and deployment strategies
- Artifact repository management (Nexus, Artifactory, cloud-native registries)
Messaging & Streaming (Required)
- Apache Kafka architecture, cluster management, and operations
- Kafka topic design, partitioning strategies, and performance tuning
- Kafka Connect experience
- Experience with Kafka management tools (Kafka Manager, Cruise Control)
- Understanding of event-driven architectures and patterns
Database & Proxy Technologies (Required)
- ProxySQL configuration, management, and optimization
- MySQL database administration basics
- Understanding of database replication and clustering
Observability & Monitoring (Required)
- Prometheus metrics collection, PromQL, and alerting rules
- Grafana dashboard design and visualization techniques
- Log aggregation and analysis
Networking (Required)
- Deep understanding of TCP/IP, DNS, HTTP/HTTPS, and network protocols
- Cloud networking concepts (VPC, subnets, routing tables, NAT, VPN)
- Load balancing strategies and implementations
- Service discovery and DNS-based routing
- Network security and firewall configuration
- Software-defined networking (SDN) concepts
Scripting & Programming
- Proficient in scripting languages: Python, Bash
- Go or python programming basics for tooling development
- YAML and JSON for configuration management
- Understanding of software development best practices
Secrets Management (Required)
- HashiCorp Vault: Advanced knowledge of Vault architecture, deployment, and operations
- Vault authentication methods and integration with cloud providers and Kubernetes
- Vault secret engines (KV v1/v2, database, transit, cloud dynamic...