Descripción
This position is to work with a global leader company in the design and development of advanced semiconductors, responsible for architectures and platforms that form the core of some of the most innovative devices on the market. Its silicon engineering drives solutions that deliver high performance, energy efficiency, and intelligent integration, complemented by a key role in advancing modern telecommunications through next-generation wireless connectivity technologies. Its solutions are integrated into billions of devices worldwide, offering an ideal professional environment for talent seeking technological impact, innovation, and growth within a global context.
Responsibilities
- Design, build, and manage cloud infrastructure (AWS + OpenStack) focused on scalability, high availability, performance, and cost efficiency.
- Implement and maintain Infrastructure as Code using Terraform, Ansible, and Kubernetes (Helm/manifests).
- Operate and scale production Kubernetes clusters while applying SRE principles (reliability, failure prevention, continuous improvement).
- Build and optimize CI/CD pipelines (e.g., Jenkins) and automate infrastructure provisioning and operations.
- Manage and scale data and streaming platforms (Kafka, NiFi, Elasticsearch, MySQL, Vertica, Zookeeper).
- Develop and maintain observability systems (Prometheus, Grafana, ELK) and lead incident management (RCA, postmortems).
- Create automation solutions, including AI-assisted workflows and LLM-based operational agents.
- Maintain runbooks, improve operational processes, and reduce technical debt.
Minimum qualifications
- 3+ years of experience in infrastructure management, Linux administration, and production systems (AWS and/or OpenStack).
- Hands-on experience managing Kubernetes in production environments.
- Strong experience with Infrastructure as Code (Terraform, Ansible, or equivalent).
- Proficiency in Python and/or Go.
- Experience with CI/CD tools (e.g., Jenkins) and modern DevOps practices.
- Knowledge of distributed systems, microservices architectures, and reliability engineering.
- Experience with data and streaming platforms (Kafka, Elasticsearch, etc.).
- Experience with monitoring and observability tools (Prometheus, Grafana, ELK).
- Ability to maintain critical systems, optimize performance, and reduce technical debt.
Preferred qualifications
- Experience integrating AI/LLMs into operational workflows.
- Familiarity with agent-based automation and AI-driven operations.
- Experience building automated runbooks and operational bots.
- Advanced experience in capacity planning and multi-region disaster recovery.
- Background in large-scale SaaS or highly distributed environments.
- Experience improving alert quality and signal-to-noise ratio in monitoring systems.
Location: Santa Fé, Tijuana