Senior Site Reliability Engineer (Australian Project)

12026-04-30

Job Description

  • Operate and improve monitoring, alerting and dashboards to quickly detect and respond to production incidents.
  • Join the on‑call roster and be ready to handle and coordinate production issues during assigned shifts.
  • Lead or contribute to incident response and post‑incident reviews, driving long‑term reliability improvements.
  • Define and maintain SLOs/SLIs and error budgets for key services; continuously improve system health visibility.
  • Partner with development teams to ensure production‑ready services (observability, deployment, rollback, performance).
  • Work with engineering and QA/Automation teams to embed observability into CI/CD pipelines and maintain an accurate service catalogue and service scorecards for key services.
  • Automate recurring operational tasks and support issues; build self‑healing workflows where appropriate.
  • Collaborate with infrastructure/platform teams to operate auto‑scaling, highly available cloud infrastructure (e.g. AWS/Azure/GCP) using Infrastructure as Code.
  • Apply FinOps thinking to optimise telemetry and platform costs (sampling, retention, storage strategies).

Responsibilities

Must have:

  • 5+ years as SRE, DevOps or Software Engineer with strong production operations responsibilities.
  • Hands‑on experience running workloads on at least one public cloud (AWS, Azure or GCP).
  • Solid skills in scripting/programming (e.g. Python, Bash, Go) and working with Linux.
  • Experience with observability tools (e.g. Prometheus/Grafana, ELK/APM, Datadog, New Relic or similar).
  • Experience with containers and orchestration (Docker, Kubernetes) and CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions or similar).
  • Understanding of high‑availability and resilience patterns (load balancing, auto‑scaling, blue‑green/canary, rollback).
  • Willingness to participate in on‑call and occasional out‑of‑hours incident / release support.
  • Contribute to improving system resilience through practices such as chaos engineering, resilience testing and close collaboration with security/risk teams where appropriate.
  • Leverage AI/ML‑assisted diagnostics and automation tools where appropriate to improve incident response and support workflows.
  • Good English communication skills for daily work with international stakeholders.

Nice to have:

  • Experience with OpenTelemetry or similar observability frameworks.
  • Experience with self‑healing / event‑driven automation, AIOps or low‑code automation.
  • Fintech / financial services background, or experience in regulated environments.
  • DevSecOps exposure and relevant cloud/SRE certifications.

Soft skills:

  • Strong ownership, problem‑solving and collaboration in cross‑functional teams.
  • Clear, structured communication; calm under pressure during incidents.
  • Curious, continuous‑learning mindset, open to new tools and practices.

What we offer

  • Attractive and competitive performance-based compensation package
  • Generous year-end 13th-month bonus
  • Loyalty and annual dedication rewards
  • Full gross salary paid during probation
  • 12 annual leave days, 11 public holidays, 1 Christmas day off and 5 sick leave days
  • Flexible check-in time, 1-day remote work per week, and the freedom to work from any of our offices in Da Nang, Hue, or Ha Noi
  • Onsite opportunity in Australia
  • Comprehensive healthcare package and annual health check-ups
  • Team-building allowance, Annual company trips, and Gathering Party every Thursday for a fun and connected workplace
  • Sports & hobby clubs with football, badminton, biking, running, chess, or music band groups
  • Continuous learning & development with exclusive technical & soft skills training, English classes, and technical clubs
  • Financial aid for marriage, newborns, and bereavement to support you through every stage of life

BACK TO CAREERS