Sr/Staff Software Engineer, Fleet Reliability and Performance
Central Denmark Region, Denmark
Posted on Tuesday, September 5, 2023
About The RoleWe build the foundation for all of Uber’s fleet of 100,000s of hosts or VMs by ensuring they are running reliable and are configured optimally for the container platforms using the hosts. We monitor for and detect a broad range of reliability and quality problems through codified processes and automatically drive remediation.We run generically across bare metal hosts and VMs and across our own on-prem data centers and multiple cloud vendors, and closely collaborate to develop integrations that ensure effective and automated management of the many hosts and VMs.Internally we integrate with Uber’s stateful and stateless container scheduling platforms to orchestrate host operations in a safe and efficient way and use this to realize remediation of bad hosts or apply fleet wide upgrades such as rolling out a new kernel.We own the base OS image and the Kernel deployed on the fleet and handle fleet-wide kernel upgrades and configuration. We provide high fidelity host and container metrics to ensure secure and optimal performance for the workloads on the hosts.Our team consists of a healthy combination of both junior and senior engineers with a broad range of experiences across the industry. We value ideas over hierarchy, always improving, getting things done through code and having a measurable impact on the business. What You Will DoYou will maximize your software engineering, systems engineering, hardware/Linux OS/kernel knowledge, cloud knowledge, and infrastructure systems experience to investigate and solve ambiguous problems in our production fleet while also contributing to planning, new systems design, and improvement of existing systems to enable even greater efficiency and insight.
- Contribute to planning, design and architecture, and building of systems, tooling and observability in support of production server fleet reliability, and cloud expansion efforts
- Actively drive collaboration across multiple teams to create alignment and progress.
- Implement solutions in Go with a strong focus on clean, readable code with unit and integration test coverage.
- Low level debugging into host level issues and generalization of detection
- Take active part in code change peer-reviews to ensure quality and knowledge sharing across the team.
- Contribute to engineering culture in terms of quality, monitoring and on-call practices.
- Own part of the team’s charter and through that help setting longer term direction for the team.
- 5+ years of experience
- BS, MS or PhD degree in computer science, similar technical field of study or equivalent practical experience
- Background in multiple programming languages, e.g., C/C++, Go etc.
- Strong hands-on experience with Linux investigating and debugging performance problems
- An inherent aim to collaborate, both within the team and across orgs
- Excellent written and verbal communication skills, and the ability to write detailed design documents, post mortems
- A belief that your team can accomplish more together than as separate individuals
- Attention to detail, particularly around software engineering fundamentals, testing methodologies, and quality
- Strong understanding of Linux kernel internals, e.g., ability to read and understand kernel code.
- Experience in kernel, hardware performance evaluation, tuning, and debugging.
- An understanding of server hardware at scale: data center network fundamentals, OS imaging, provisioning, distribution, and configuration deployment at a large scale.
- Experience with cloud and migration to cloud is a plus.
- Experience with large distributed systems.
- Experience with containerization software such as Kubernetes, Docker, Mesos.
- Comfortable working with on-prem and cloud-based infrastructure (AWS, GCP).