Platform Specialist – Reliability
Houston | $220,000 – $240,000 + Bonus
Ncounter is supporting a global quantitative investment manager whose research and trading platforms run on large-scale, self-operated infrastructure. This position sits across compute, orchestration, observability, automation, and platform engineering, focusing on the reliability, performance, and operability of critical services.
You’ll work within a highly distributed environment, partnering with software and infrastructure teams to improve resilience, reduce operational overhead, and build the tooling and automation that keeps the platform performing at scale.
Key Responsibilities
- Improve reliability, resilience, and performance across platform services.
- Operate and maintain self-managed Kubernetes environments.
- Build automation and internal tooling using Python or Go.
- Design observability through metrics, dashboards, alerting, and telemetry.
- Support incident response, capacity planning, and service reliability initiatives.
- Drive operational improvements through SLOs and reliability engineering practices.
Experience Required
- Experience operating large-scale compute environments including HPC, batch, grid, SLURM, Kubernetes batch, or similar platforms.
- Strong Kubernetes administration experience, including etcd, cluster upgrades, API server tuning, CNI technologies (Calico/Cilium), storage, and ingress.
- Linux troubleshooting expertise across RHEL, Rocky Linux, or Ubuntu environments.
- Strong observability experience with Prometheus, Grafana, PromQL, and Alertmanager.
- Experience with enterprise-scale time-series platforms such as VictoriaMetrics, Thanos, Cortex, or Mimir would be highly advantageous.
- Infrastructure-as-Code, CI/CD, and automation experience.
- Programming or scripting capability using Python or Go.
- A reliability-first mindset, with experience of SLOs, incident management, and capacity planning.
This role would suit an engineer who has operated the infrastructure layers that managed services typically abstract away, and who sees reliability as something to be measured, engineered, and continuously improved.


