Senior/Expert Engineer, Site Reliability (SRE)
Job Description
- Deep dive into development lines, learning and understanding the mechanism of every application component, and promoting product scalability, stability and performance.
- Setup, manage and maintain product, middleware, big-data applications and services.
- Perform regular and ad-hoc server-side deployments, performance fine-tuning and troubleshooting.
- Design and develop automations for workflows.
- Capacity and Resource management.
- Responsible for the full-chain stress test to enhance the performance and remove redundancy of applications.
- Prepare routine operation documentation.
Job Requirements
- Bachelor’s or higher degree in Computer Science, Engineering, Information Systems or related fields.
- Minimum 2 years of working experience in Site Reliability Engineer roles.
- Extensive and hands-on knowledge with Linux operating systems (Ubuntu, CentOS, etc.).
- Knowledge of Computer Network(TCP/IP, DNS, etc.) and OS.
- Hands-on experience with at least one of the programming languages: Bash, Python, Go.
- Strong analytical and problem-solving skills with the ability to thrive under difficult and stressful situations.
- Passion and high sense of responsibility for work.
- Fast learning ability and a good team player.
- Detailed-oriented, cautious and prudent.
Skills below are optional but preferable:
- Experience with automation tools like Ansible, Jenkins.
- Experience with monitoring tools like Prometheus, Zabbix, Grafana etc.
- Experience with load balancing tools like LVS, Nginx, Openresty or HAProxy.
- Experience with container technology such as Docker, Kubernetes.
- Experience with Kafka and Codis