Senior Site Reliability Engineer (Australian Project)

12026-04-30

Job Description

Operate and improve monitoring, alerting and dashboards to quickly detect and respond to production incidents.
Join the on‑call roster and be ready to handle and coordinate production issues during assigned shifts.
Lead or contribute to incident response and post‑incident reviews, driving long‑term reliability improvements.
Define and maintain SLOs/SLIs and error budgets for key services; continuously improve system health visibility.
Partner with development teams to ensure production‑ready services (observability, deployment, rollback, performance).
Work with engineering and QA/Automation teams to embed observability into CI/CD pipelines and maintain an accurate service catalogue and service scorecards for key services.
Automate recurring operational tasks and support issues; build self‑healing workflows where appropriate.
Collaborate with infrastructure/platform teams to operate auto‑scaling, highly available cloud infrastructure (e.g. AWS/Azure/GCP) using Infrastructure as Code.
Apply FinOps thinking to optimise telemetry and platform costs (sampling, retention, storage strategies).

Must have:

5+ years as SRE, DevOps or Software Engineer with strong production operations responsibilities.
Hands‑on experience running workloads on at least one public cloud (AWS, Azure or GCP).
Solid skills in scripting/programming (e.g. Python, Bash, Go) and working with Linux.
Experience with observability tools (e.g. Prometheus/Grafana, ELK/APM, Datadog, New Relic or similar).
Experience with containers and orchestration (Docker, Kubernetes) and CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions or similar).
Understanding of high‑availability and resilience patterns (load balancing, auto‑scaling, blue‑green/canary, rollback).
Willingness to participate in on‑call and occasional out‑of‑hours incident / release support.
Contribute to improving system resilience through practices such as chaos engineering, resilience testing and close collaboration with security/risk teams where appropriate.
Leverage AI/ML‑assisted diagnostics and automation tools where appropriate to improve incident response and support workflows.
Good English communication skills for daily work with international stakeholders.

Nice to have:

Experience with OpenTelemetry or similar observability frameworks.
Experience with self‑healing / event‑driven automation, AIOps or low‑code automation.
Fintech / financial services background, or experience in regulated environments.
DevSecOps exposure and relevant cloud/SRE certifications.

Soft skills:

Attractive and competitive performance-based compensation package
Generous year-end 13th-month bonus
Loyalty and annual dedication rewards
Full gross salary paid during probation
12 annual leave days, 11 public holidays, 1 Christmas day off and 5 sick leave days
Flexible check-in time, 1-day remote work per week, and the freedom to work from any of our offices in Da Nang, Hue, or Ha Noi
Onsite opportunity in Australia
Comprehensive healthcare package and annual health check-ups
Team-building allowance, Annual company trips, and Gathering Party every Thursday for a fun and connected workplace
Sports & hobby clubs with football, badminton, biking, running, chess, or music band groups
Continuous learning & development with exclusive technical & soft skills training, English classes, and technical clubs
Financial aid for marriage, newborns, and bereavement to support you through every stage of life