Platform Specialist – Reliability

Houston | $220,000 – $240,000 + Bonus

Ncounter is supporting a global quantitative investment manager whose research and trading platforms run on large-scale, self-operated infrastructure. This position sits across compute, orchestration, observability, automation, and platform engineering, focusing on the reliability, performance, and operability of critical services.

You’ll work within a highly distributed environment, partnering with software and infrastructure teams to improve resilience, reduce operational overhead, and build the tooling and automation that keeps the platform performing at scale.

Key Responsibilities

Improve reliability, resilience, and performance across platform services.
Operate and maintain self-managed Kubernetes environments.
Build automation and internal tooling using Python or Go.
Design observability through metrics, dashboards, alerting, and telemetry.
Support incident response, capacity planning, and service reliability initiatives.
Drive operational improvements through SLOs and reliability engineering practices.

Experience Required

Experience operating large-scale compute environments including HPC, batch, grid, SLURM, Kubernetes batch, or similar platforms.
Strong Kubernetes administration experience, including etcd, cluster upgrades, API server tuning, CNI technologies (Calico/Cilium), storage, and ingress.
Linux troubleshooting expertise across RHEL, Rocky Linux, or Ubuntu environments.
Strong observability experience with Prometheus, Grafana, PromQL, and Alertmanager.
Experience with enterprise-scale time-series platforms such as VictoriaMetrics, Thanos, Cortex, or Mimir would be highly advantageous.
Infrastructure-as-Code, CI/CD, and automation experience.
Programming or scripting capability using Python or Go.
A reliability-first mindset, with experience of SLOs, incident management, and capacity planning.

This role would suit an engineer who has operated the infrastructure layers that managed services typically abstract away, and who sees reliability as something to be measured, engineered, and continuously improved.

Platform Specialist | Reliability

Platform Specialist – Reliability

Houston | $220,000 – $240,000 + Bonus

Key Responsibilities

Experience Required

need help finding the right organization —
talk to one of our representatives

Contact Info

For candidates

For Employers

Company

your trusted partner in building careers

get in touch

Platform Specialist | Reliability

Platform Specialist – Reliability

Houston | $220,000 – $240,000 + Bonus

Key Responsibilities

Experience Required

need help finding the right organization — talk to one of our representatives

Contact Info

For candidates

For Employers

Company

your trusted partner in building careers

get in touch

need help finding the right organization —
talk to one of our representatives