Prometheus & Grafana Engineer
Job Description:
Environment:
The provider uses Prometheus, which monitors cloud-native systems, such as Kubernetes. It is the only system directly supported by Kubernetes and is the de facto standard in the entire cloud-native ecosystem. The data is graphically processed with the help of Grafana and made available in a dashboard.
Looking for an experienced SME/L3-Engineer with a deep understanding of Grafana and Prometheus to join our team. In this role, you will be responsible for maintaining, optimizing, and advancing our monitoring and observability systems. Your expertise will be critical in ensuring the reliability, performance, and scalability of our infrastructure. You will be owning the overall health/availability/configurations of Grafana and Prometheus solutions.
Key responsibilities:
- Grafana and Prometheus Administration
- Configure, maintain, and scale Grafana and Prometheus instances.
- Develop and implement custom dashboards for monitoring key metrics.
- Troubleshoot issues, ensure data accuracy, and optimize query performance.
- Monitoring and Alerting:
- Design and manage alerting rules for proactive issue identification and resolution.
- Continuously improve and expand monitoring coverage to meet evolving needs.
- Collaborate with teams to define alert thresholds and escalation procedures.
- Data Analysis and Visualization:
- Analyze metrics data to identify performance bottlenecks and areas for improvement.
- Create meaningful visualizations and reports to provide insights for stakeholders.
- Contribute to the enhancement of data retention and archiving strategies.
- Scaling and Optimization:
- Collaborate with the infrastructure team to ensure seamless integration and scalability of Grafana and Prometheus.
- Fine-tune configurations to achieve optimal resource utilization and performance.
- Proven experience as an L3 Engineer specializing in Grafana and Prometheus administration.
- Proficiency in creating custom Grafana dashboards and queries.
- Strong understanding of monitoring best practices, alerting, and data analysis.
- Knowledge of time-series databases and storage strategies.
- Scripting and automation skills for efficient system management.
Apply Now
Share this opportunity
Can’t Find the Job of Your Choice?
Never miss out on new jobs at HCLTech.