Senior Site Reliability Engineer, Supply
Company: Mithril
Location: San Francisco
Posted on: April 1, 2026
|
|
|
Job Description:
Mithril is actively seeking talented candidates at the Senior to
Principal level, with leveling determined based on experience and
demonstrated expertise. We welcome individuals who bring deep
technical knowledge, strategic thinking, and a track record of
impact, and we tailor roles to align with each candidate’s unique
strengths and career trajectory. About Mithril At Mithril, we are
transforming the way AI companies access compute power. Our mission
is to orchestrate the world’s compute capacity, making it easier to
use and optimized for AI workloads. We're building a new type of
public cloud—one designed specifically for AI, where accessing
high-performance compute is as simple and reliable as flipping a
switch. Spun out of a Stanford PhD lab just over a year ago,
Mithril has already gained the backing of top-tier investors like
Sequoia, Lightspeed, Jeff Dean, Eric Schmidt, and others. With $80M
in funding, we’ve secured customers and are generating revenue.
Today’s machine learning infrastructure is overly complex.
Engineers are forced to push through challenges in hardware and
capacity, when their focus should be on solving big problems and
driving innovation. Our Omnicloud Platform changes that. By
abstracting away hardware management and providing seamless access
to compute resources, we enable engineers to focus on what really
matters—building transformative AI solutions. We're not just making
compute accessible; we're making it flexible, scalable, and
tailored for the unique demands of cutting-edge AI. Our
infrastructure marketplace brings the power of state-of-the-art
hardware and Mithril’s Omnicloud Platform and software strengths,
enabling AI companies to innovate faster and more efficiently.
Working at Mithril As an engineer at Mithril, you’ll be at the
forefront of building the infrastructure that powers the future of
AI. Your role is critical—not just in scaling our systems, but in
ensuring they are reliable and secure at every level. You will help
Mithril build and operate solutions that harvest compute resources
world-wide, and make them available to global customers at every
aspect of the emerging AI / ML ecosystems – from high-performance
clusters to foundation models to finetuning to inferencing to
mixture of experts to agentic application workflows. You will solve
world-class technical challenges in enabling cutting-edge AI
workloads to leverage humanity's knowledge, with planet-scale
computing. You will help our customers succeed on every front in
using the power of AI to make the world a better place. If you're
motivated by the challenge of scaling and securing the core
infrastructure behind AI and thrive in a fast-paced, high-impact
environment, Mithril is the place for you. Here, autonomy,
ownership, and high-quality engineering are paramount. You will be
part of a collaborative team pushing the boundaries of technology,
and be a dependable partner of our customers and suppliers. We
value qualities like can-do attitude, clear communication,
meticulous engineering, relentless innovation. Join us and be part
of something transformative! Joining Mithril As a key member of our
Supply engineering team, you’ll enable the sustainable, reliable
growth of Mithril’s compute supply. You’ll be a dedicated technical
representative to oversee day-to-day technical operations across
our supply-customer fleet, as well as manage compute partner
relationships both before and after acquisition. You’ll apply
cutting-edge techniques and tooling with a focus on managing the
stability of advanced GPU services and optimizing monitoring,
alerting, and incident response frameworks, guided by Service Level
Indicators (SLIs) and Objectives (SLOs). Collaboration is at the
heart of this role—you’ll work closely with internal product teams
and external Mithril partners. Participation in an on-call rotation
will be essential to maintain service reliability. If you’re
passionate about engineering cloud-based datacenter reliability and
thrive in a dynamic environment where innovation and stability go
hand in hand, Mithril offers a unique opportunity to drive
impactful change and shape the future of our infrastructure.
Responsibilities Design, deploy, and manage scalable, secure, and
highly available Kubernetes clusters in both cloud and on-premises
environments Execute, refine, and create Ansible playbooks to
perform routine maintenance, load testing, and system burn-in
operations across the Mithril’s fleet Deploy and oversee monitoring
systems, such as Grafana, to proactively detect issues and
anomalies in our supplier environment Establish and uphold service
level objectives (SLOs) and service level indicators (SLIs) to
gauge and uphold system reliability Leading or participating in
incident response and root cause analysis Provide regular updates
on machine operability, swiftly notifying internal and external
partners of disruptions to maintain system availability and
supplier confidence Serve as the primary liaison with suppliers,
maintaining a regular meeting cadence to communicate Mithril’s
requirements and address supplier inquiries Coordinate
cross-functional supply-related initiatives, ensuring all
stakeholders are informed, aligned, and prepared for upcoming
changes or maintenance events Requirements Proven experience
deploying, scaling, and maintaining production-grade Kubernetes
clusters across both cloud or on-prem environments Bachelor’s
degree in Computer Science, Computer Engineering, or a related
field, or equivalent professional experiences Experience working
with Linux systems administration and command-line interfaces
Ability to create technical documentation and technical specs
Scripting and automation skills (Python, Bash, or similar)
Understanding of key infrastructure metrics (CPU, memory, network
utilization, error rates) Understanding of data center operations:
disaster recovery, maintenance schedules, capacity planning Strong
written and verbal communication skills, with ability to translate
technical concepts for various audiences Project management
experience and ability to handle multiple priorities Demonstrated
problem-solving and analytical thinking skills Experience leading
or participating in incident response and root cause analysis Nice
to have Familiarity with GPU/CPU cluster management and
optimization Proficiency with Git or similar version control
systems Experience with Prometheus or Grafana monitoring and
observability tools Experience in technical training or presenting
technical content Prior experience as a Site Reliability Engineer
(SRE) in the AI/ML domain is highly desirable Familiar with the
challenges around scaling large scale infrastructure Familiarity
with hardware lifecycle management (RMA) Experience in technical
customer or vendor-facing roles Benefits Health, dental, and vision
coverage for you and your dependents 401k Plan with 4% company
match 21 days of PTO & 14 company holidays; including 2 floating
holidays Salary Range Information In consideration of market
analysis and various pertinent factors, the remuneration bracket
for this role is set between $170,000 and $230,000. Nevertheless,
adjustments beyond this range could be warranted for candidates
whose qualifications substantially deviate from those delineated in
the job description. In-Office requirement At Mithril, we take our
work extremely seriously, though not always ourselves. We recognize
that we are striving to achieve something substantial—an
all-too-rare and elusive counterfactual contribution. Our work is
not easy, so we seek out any lever that can accelerate our progress
and increase the likelihood of realizing our full ambitions.
Working collaboratively in person is one such lever. Our
headquarters is in Palo Alto (next to Caltrain on University Ave.),
and we recently opened a new office in San Francisco (Financial
District/SoMa) for our teammates based there. We expect team
members to primarily work from their local office (Palo Alto or San
Francisco), with everyone gathering at HQ one day a week while our
team remains small and cross-team collaboration is critical. This
approach is built on trust. We take our mission seriously and are
committed to fostering an environment where you can make impactful
decisions and drive success. We also understand that life can
present challenges, and if extenuating circumstances arise, we’re
here to support you. Ultimately, we believe this guidance helps us
be as effective as possible while maintaining the spirit of
teamwork and flexibility. Equal Opportunity Employer Mithril
maintains a strict commitment to Equal Opportunity employment
practices. All applicants are evaluated without regard to race,
color, religion, creed, national origin, age, sex, gender, marital
status, sexual orientation and identity, genetic information,
veteran status, citizenship, or any other factors prohibited by
local, state, or federal law. We emphasize that candidates need not
fulfill every expectation listed to be eligible for this position.
Our objective is to cultivate a diverse team encompassing a
spectrum of backgrounds, experiences, and skill sets.
Keywords: Mithril, Stockton , Senior Site Reliability Engineer, Supply, IT / Software / Systems , San Francisco, California