Senior Site Reliability Engineer (Australian Project)
12026-04-30
Job Description
- Operate and improve monitoring, alerting and dashboards to quickly detect and respond to production incidents.
- Join the on‑call roster and be ready to handle and coordinate production issues during assigned shifts.
- Lead or contribute to incident response and post‑incident reviews, driving long‑term reliability improvements.
- Define and maintain SLOs/SLIs and error budgets for key services; continuously improve system health visibility.
- Partner with development teams to ensure production‑ready services (observability, deployment, rollback, performance).
- Work with engineering and QA/Automation teams to embed observability into CI/CD pipelines and maintain an accurate service catalogue and service scorecards for key services.
- Automate recurring operational tasks and support issues; build self‑healing workflows where appropriate.
- Collaborate with infrastructure/platform teams to operate auto‑scaling, highly available cloud infrastructure (e.g. AWS/Azure/GCP) using Infrastructure as Code.
- Apply FinOps thinking to optimise telemetry and platform costs (sampling, retention, storage strategies).
Responsibilities
Must have:
- 5+ years as SRE, DevOps or Software Engineer with strong production operations responsibilities.
- Hands‑on experience running workloads on at least one public cloud (AWS, Azure or GCP).
- Solid skills in scripting/programming (e.g. Python, Bash, Go) and working with Linux.
- Experience with observability tools (e.g. Prometheus/Grafana, ELK/APM, Datadog, New Relic or similar).
- Experience with containers and orchestration (Docker, Kubernetes) and CI/CD pipelines (Jenkins, GitLab CI, GitHub Actions or similar).
- Understanding of high‑availability and resilience patterns (load balancing, auto‑scaling, blue‑green/canary, rollback).
- Willingness to participate in on‑call and occasional out‑of‑hours incident / release support.
- Contribute to improving system resilience through practices such as chaos engineering, resilience testing and close collaboration with security/risk teams where appropriate.
- Leverage AI/ML‑assisted diagnostics and automation tools where appropriate to improve incident response and support workflows.
- Good English communication skills for daily work with international stakeholders.
Nice to have:
- Experience with OpenTelemetry or similar observability frameworks.
- Experience with self‑healing / event‑driven automation, AIOps or low‑code automation.
- Fintech / financial services background, or experience in regulated environments.
- DevSecOps exposure and relevant cloud/SRE certifications.
Soft skills:
- Strong ownership, problem‑solving and collaboration in cross‑functional teams.
- Clear, structured communication; calm under pressure during incidents.
- Curious, continuous‑learning mindset, open to new tools and practices.
What we offer
- Attractive and competitive performance-based compensation package
- Generous year-end 13th-month bonus
- Loyalty and annual dedication rewards
- Full gross salary paid during probation
- 12 annual leave days, 11 public holidays, 1 Christmas day off and 5 sick leave days
- Flexible check-in time, 1-day remote work per week, and the freedom to work from any of our offices in Da Nang, Hue, or Ha Noi
- Onsite opportunity in Australia
- Comprehensive healthcare package and annual health check-ups
- Team-building allowance, Annual company trips, and Gathering Party every Thursday for a fun and connected workplace
- Sports & hobby clubs with football, badminton, biking, running, chess, or music band groups
- Continuous learning & development with exclusive technical & soft skills training, English classes, and technical clubs
- Financial aid for marriage, newborns, and bereavement to support you through every stage of life