Operations Engineer - Platform Operations

This job is about being an Operations Engineer in the Platform Operations team at Splunk, a company that provides a leading platform for monitoring, searching, analyzing, and visualizing machine-generated data; you will do operational work to ensure the reliability, availability, and performance of Splunk's cloud platform while collaborating with engineering teams to improve operational efficiency.
Key Responsibilities:
- Provide operational support for Splunk's cloud platform, including troubleshooting and resolving incidents
- Participate in on-call rotations to ensure 24/7 platform availability
- Develop and maintain automation tools to improve operational efficiency
- Collaborate with engineering teams to implement and improve monitoring solutions
- Create and maintain documentation for operational procedures
- Analyze and resolve complex technical issues across the platform
- Contribute to continuous improvement initiatives for platform reliability
Requirements:
- Experience with Linux/Unix systems administration
- Knowledge of cloud platforms (AWS, GCP, Azure)
- Understanding of networking concepts and protocols
- Familiarity with monitoring tools and observability practices
- Experience with scripting languages (Python, Bash, etc.)
- Strong problem-solving and troubleshooting skills
- Ability to work in a fast-paced, collaborative environment
- Experience with configuration management tools is a plus
- Knowledge of containerization technologies (Docker, Kubernetes) is beneficial
This job is about being an Operations Engineer in the Platform Operations team at Splunk, a company that provides a leading platform for monitoring, searching, analyzing, and visualizing machine-generated data; you will do operational work to ensure the reliability, availability, and performance of Splunk's cloud platform while collaborating with engineering teams to improve operational efficiency.
Key Responsibilities:
- Provide operational support for Splunk's cloud platform, including troubleshooting and resolving incidents
- Participate in on-call rotations to ensure 24/7 platform availability
- Develop and maintain automation tools to improve operational efficiency
- Collaborate with engineering teams to implement and improve monitoring solutions
- Create and maintain documentation for operational procedures
- Analyze and resolve complex technical issues across the platform
- Contribute to continuous improvement initiatives for platform reliability
Requirements:
- Experience with Linux/Unix systems administration
- Knowledge of cloud platforms (AWS, GCP, Azure)
- Understanding of networking concepts and protocols
- Familiarity with monitoring tools and observability practices
- Experience with scripting languages (Python, Bash, etc.)
- Strong problem-solving and troubleshooting skills
- Ability to work in a fast-paced, collaborative environment
- Experience with configuration management tools is a plus
- Knowledge of containerization technologies (Docker, Kubernetes) is beneficial