Reliability Enablers

著者: Ash Patel & Sebastian Vietz
  • サマリー

  • Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

    read.srepath.com
    Ash P
    続きを読む 一部表示

あらすじ・解説

Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.

read.srepath.com
Ash P
エピソード
  • #63 - Does "Big Observability" Neglect Mobile?
    2024/11/12

    Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short.

    * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.

    * Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.

    * Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.

    * Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.

    * Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.

    * Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.

    * Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.

    * Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.

    * OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability.

    * SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.

    * Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.

    * Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting.



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    29 分
  • #62 - Early Youtube SRE shares Modern Reliability Strategy
    2024/11/05
    Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that engineers face every day.Here’s a slightly deeper dive into the concepts we discussed:* Career and Evolution in Tech: Andrew shares his journey through various roles, from early SRE at Youtube to VP of Infrastructure at Dropbox to Director of Engineering at Databricks, with extensive experience in infrastructure through three distinct eras of the internet. He emphasized the transition from early infrastructure roles into specialized SRE functions, noting the rise of SRE as a formalized role and the evolution of responsibilities within it.* Building Prodvana and the Future of SRE: As CEO of startup, Prodvana, they're focused on an "intelligent delivery system" designed to simplify production management for engineers, addressing cognitive overload. They highlight SRE as a field facing new demands due to AI, discussing insights shared with Niall Murphy and Corey Bertram around AI's potential in the space, distinguishing it from "web three" hype, and affirming that while AI will transform SRE, it will not eliminate it.* Challenges of Migration and Integration: Reflecting on experiences at YouTube post-acquisition by Google, the speaker discusses the challenges of migrating YouTube’s infrastructure onto Google’s proprietary, non-thread-safe systems. This required extensive adaptation and “glue code,” offering insights into the intricacies and sometimes rigid culture of Google’s engineering approach at that time.* SRE’s Shift Toward Reliability as a Core Feature: The speaker describes how SRE has shifted from system-level automation to application reliability, with growing recognition that reliability is a user-facing feature. They emphasize that leadership buy-in and cultural support are essential for organizations to evolve beyond reactive incident response to proactive, reliability-focused SRE practices.* Organizational Culture and Leadership Influence: Leadership’s role in SRE success is highlighted as crucial, with examples from Dropbox and Google emphasizing that strong, supportive leadership can shape positive, reliability-centered cultures. The speaker advises engineers to gauge leadership attitudes towards SRE during job interviews to find environments where reliability is valued over mere incident response.* Outcome-Focused Work Over Titles: Emphasis on assembling the right team based on skills, not titles, to solve technical problems effectively. Titles often distract from focusing on outcomes, and fostering a problem-solving culture over role-based thinking accelerates teamwork and results.* Engineers as Problem Solvers: Engineers, especially natural ones, generally resist job boundaries and focus on solving problems rather than sticking rigidly to job descriptions. This echoes how iconic engineers like Steve Jobs valued versatility over predefined roles.* Culture as Core Values: Organizational culture should be driven by core values like reliability, efficiency, and inclusivity rather than rigid processes or roles. For instance, Dropbox's infrastructure culture emphasized being a “force multiplier” to sustain product velocity, an approach that ensured values were integrated into every decision.* Balancing SRE and Platform Priorities: The fundamental difference between SRE (Site Reliability Engineering) and platform engineering is their focus: SRE prioritizes reliability, while platform engineering is geared toward increasing velocity or reducing costs. Leaders must be cautious when assigning both roles simultaneously, as each requires a distinct focus and expertise.* Strategic Trade-Offs in Smaller Orgs: In smaller companies with limited resources, leaders often face challenges balancing cost, reliability, and other objectives within single roles. It's advised to sequence these priorities rather than burden one individual with conflicting objectives. Prioritizing platform stability, for example, can help improve reliability in the long term.* DevOps as a Philosophy: DevOps is viewed here as an operational philosophy rather than a separate role. The approach enhances both reliability and platform functions by fostering a collaborative, efficient work culture.* Focus Investments for Long-Term Gains: Strategic technology investments, even if they might temporarily hinder short-term metrics (like reliability), can drive long-term efficiency and reliability improvements. For instance, Dropbox invested in a shared metadata system to enable active-active disaster recovery, viewing this ...
    続きを読む 一部表示
    36 分
  • #61 Scott Moore on SRE, Performance Engineering, and More
    2024/10/22



    This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
    続きを読む 一部表示
    38 分

Reliability Enablersに寄せられたリスナーの声

カスタマーレビュー:以下のタブを選択することで、他のサイトのレビューをご覧になれます。