This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.
Niall Murphy has been working in Internet infrastructure for twenty years. He is a company founder, a published author, a photographer, and holds degrees in Computer Science & Mathematics and Poetry Studies.
Dave Rensin is a Google SRE Director, previous O’Reilly author, and serial entrepreneur. He holds a degree in Statistics.
Kent Kawahara is a Program Manager for Google's Site Reliability Engineering team focused on Google Cloud Platform customers and is based in Sunnyvale, CA. In previous Google roles, he managed technical and design teams to develop advertising support tools and worked with large advertisers and agencies on strategic advertising initiatives. Prior to Google, he worked in Product Management, Software QA, and Professional Services at two successful telecommunications startups. He holds a BS Electrical Engineering and Computer Science from the University of California at Berkeley.
Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it.
Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response.
Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis.
The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.What you will learnMonitor for approaching catastrophic failureAlert your team to an outage emergencyDissect your incident response strategiesTest automation tools and build your own softwarePredict bottlenecks and fight for user experienceEliminate the competition in an SRE interviewWho this book is for
Real-World SRE is aimed at software developers facing a website crisis, or who want to improve the reliability of their company's software. Newcomers to Site Reliability Engineering looking to succeed at interview will also find this invaluable.
Site reliability engineering (SRE) is being touted as the most competent paradigm in establishing and ensuring next-generation high-quality software solutions.
This book starts by introducing you to the SRE paradigm and covers the need for highly reliable IT platforms and infrastructures. As you make your way through the next set of chapters, you will learn to develop microservices using Spring Boot and make use of RESTful frameworks. You will also learn about GitHub for deployment, containerization, and Docker containers. Practical Site Reliability Engineering teaches you to set up and sustain containerized cloud environments, and also covers architectural and design patterns and reliability implementation techniques such as reactive programming, and languages such as Ballerina and Rust. In the concluding chapters, you will get well-versed with service mesh solutions such as Istio and Linkerd, and understand service resilience test practices, API gateways, and edge/fog computing.
By the end of this book, you will have gained experience on working with SRE concepts and be able to deliver highly reliable apps and services.What you will learnUnderstand how to achieve your SRE goalsGrasp Docker-enabled containerization conceptsLeverage enterprise DevOps capabilities and Microservices architecture (MSA)Get to grips with the service mesh concept and frameworks such as Istio and LinkerdDiscover best practices for performance and resiliencyFollow software reliability prediction approaches and enable patternsUnderstand Kubernetes for container and cloud orchestrationExplore the end-to-end software engineering process for the containerized worldWho this book is for
Practical Site Reliability Engineering helps software developers, IT professionals, DevOps engineers, performance specialists, and system engineers understand how the emerging domain of SRE comes handy in automating and accelerating the process of designing, developing, debugging, and deploying highly reliable applications and services.
Now design phase gate review and testing expert and veteran technical problem solver "Thim Gurunatha" brings to his readers a lifetime of experience in designing robust and reliable processes. In his new book, "Systems Engineering Standards -- The State of the Art ," Thim systematically tackles fundamental and esoteric problems that plague manufacturing and systems engineers today. Thim understands that while modern technologies, including computing technologies, have greatly aided today's engineers, they have also revealed gaps, cracks and chinks which were not apparent before. With this new book, Thim's mission is to close all the little gaps towards developing perfect processes.
Coming in to fill a critical void, Thim's new book teaches engineers to make the process of statistical process control (SPC) more efficient. Even the most seasoned engineers will learn how to make the design of experiments less expensive, reduce testing time and increase the accuracy of reliability predictions. The author lucidly articulates that the survival of companies in future may depend on the implementation of breakthrough strategies in problem solving. In such an environment, understanding and promoting the use of statistical tools becomes a management issue rather than an operator problem.
Used effectively, statistical methods greatly reduce problem-solving time. Because of the abundance of statistical tools, however, it is important to know which tools to use when -- and which tools not to use. Thim's direct-to-action book helps systems engineers pick the 'best of the best' tools for each application and assists its users in applying these tools, saving them millions of dollars. Surely readers can recession proof their careers with the wisdom in this brand new book!
In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.
This book is divided into four sections:Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practicesPrinciples—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)Practices—Understand the theory and practice of an SRE’s day-to-day work: building and operating large distributed computing systemsManagement—Explore Google's best practices for training, communication, and meetings that your organization can use
SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful O’Reilly book that described Google’s creation of the discipline and the implementation that’s allowed them to operate at a planetary scale. Inspired by that earlier work, this book explores a very different part of the SRE space. The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now.
Listen as engineers and other leaders in the field discuss:Different ways of implementing SRE and SRE principles in a wide variety of settingsHow SRE relates to other approaches such as DevOpsSpecialties on the cutting edge that will soon be commonplace in SREBest practices and technologies that make practicing SRE easierThe important but rarely explored human side of SRE
David N. Blank-Edelman is the book’s curator and editor.
You’ll learn how to use Amazon Web Services (AWS) to build a private Windows domain, complete with Active Directory, enterprise email, instant messaging, IP telephony, automated management, and other services. By the end of the book, you’ll have a fully functioning IT infrastructure you can operate for less than $300 per month.Learn about Virtual Private Cloud (VPC) and other AWS tools you’ll useCreate a Windows domain and set up a DNS management systemInstall Active Directory and a Windows Primary Domain ControllerUse Microsoft Exchange to set up an enterprise email serviceImport existing Windows Server-based virtual machines into your VPCSet up an enterprise-class chat/IM service, using the XMPP protocolInstall and configure a VoIP PBX telephony system with Asterisk and FreePBXKeep your network running smoothly with automated backup and restore, intrusion detection, and fault alerting
Authors Kelsey Hightower, Brendan Burns, and Joe Beda—who’ve worked on Kubernetes at Google and other organizatons—explain how this system fits into the lifecycle of a distributed application. You will learn how to use tools and APIs to automate scalable distributed systems, whether it is for online services, machine-learning applications, or a cluster of Raspberry Pi computers.Explore the distributed system challenges that Kubernetes addressesDive into containerized application development, using containers such as DockerCreate and run containers on Kubernetes, using the docker image format and container runtimeExplore specialized objects essential for running applications in productionReliably roll out new software versions without downtime or errorsGet examples of how to develop and deploy real-world applications in Kubernetes