
Smart home solutions for a connected future
Plume is a leader in smart home services, providing advanced internet service management tools that enhance customer experiences for over 50 million homes globally. Headquartered in Palo Alto, California, Plume partners with major ISPs like Comcast and Vodafone to deliver its innovative solutions. T...
Plume offers competitive salaries, equity options, flexible remote work policies, and generous PTO to support work-life balance....
Plume fosters a culture of innovation and agility, encouraging employees to experiment and implement new ideas in the rapidly evolving smart home indu...

Plume • Ljubljana, Slovenia; Wroclaw, Poland
Plume is seeking a Lead Site Reliability Engineer to oversee the reliability and performance of their innovative service delivery platform. You'll be responsible for ensuring system stability and scalability while collaborating with cross-functional teams. This role requires strong leadership and technical expertise in SRE practices.
You have a strong background in Site Reliability Engineering, with experience in managing large-scale systems and ensuring their reliability and performance. You understand the importance of monitoring, incident response, and capacity planning, and you have a proven track record of implementing best practices in these areas. Your leadership skills enable you to guide and mentor a team, fostering a culture of collaboration and continuous improvement. You are comfortable working with cross-functional teams, bridging the gap between development and operations to ensure seamless service delivery.
Experience with cloud platforms and infrastructure as code is a plus, as is familiarity with CI/CD pipelines and automation tools. You are proactive in identifying potential issues and implementing solutions before they impact users. Your analytical mindset allows you to dive deep into system metrics and logs, extracting insights that drive performance enhancements.
In this role, you will lead the Site Reliability Engineering team at Plume, focusing on maintaining the reliability and performance of our service delivery platform. You will develop and implement monitoring and alerting systems to proactively identify and resolve issues. Collaborating closely with development teams, you will ensure that new features are designed with reliability in mind, conducting thorough reviews and providing feedback on architecture and implementation.
You will also be responsible for incident management, leading post-mortem analyses to identify root causes and prevent future occurrences. Your expertise will guide the team in capacity planning and scaling strategies, ensuring that our systems can handle increasing loads as we expand our services globally. You will foster a culture of learning and improvement, encouraging team members to share knowledge and grow their skills.
At Plume, we offer a dynamic work environment where innovation is at the forefront. You will have the opportunity to work with cutting-edge technology and contribute to a platform that impacts millions of users worldwide. We value collaboration and encourage you to bring your ideas to the table. Our team is committed to professional development, providing resources and support for continuous learning. Join us in shaping the future of connected spaces and making a difference in people's lives.
Apply now or save it for later. Get alerts for similar jobs at Plume.