The Site Reliability Workbook: Practical Ways to Implement SRE

Free sample

In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.

This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.

Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.

You’ll learn:

  • How to run reliable services in environments you don’t completely control—like cloud
  • Practical applications of how to create, monitor, and run your services via Service Level Objectives
  • How to convert existing ops teams to SRE—including how to dig out of operational overload
  • Methods for starting SRE from either greenfield or brownfield
Read more

About the author

Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.

Niall Murphy has been working in Internet infrastructure for twenty years. He is a company founder, a published author, a photographer, and holds degrees in Computer Science & Mathematics and Poetry Studies.

Dave Rensin is a Google SRE Director, previous O’Reilly author, and serial entrepreneur. He holds a degree in Statistics.

Kent Kawahara is a Program Manager for Google's Site Reliability Engineering team focused on Google Cloud Platform customers and is based in Sunnyvale, CA. In previous Google roles, he managed technical and design teams to develop advertising support tools and worked with large advertisers and agencies on strategic advertising initiatives. Prior to Google, he worked in Product Management, Software QA, and Professional Services at two successful telecommunications startups. He holds a BS Electrical Engineering and Computer Science from the University of California at Berkeley.

Read more
1 total

Additional Information

"O'Reilly Media, Inc."
Read more
Published on
Jul 25, 2018
Read more
Read more
Read more
Read more
Read more
Read more
Computers / System Administration / General
Read more
Content Protection
This content is DRM free.
Read more
Read Aloud
Available on Android devices
Read more
Eligible for Family Library

Reading information

Smartphones and Tablets

Install the Google Play Books app for Android and iPad/iPhone. It syncs automatically with your account and allows you to read online or offline wherever you are.

Laptops and Computers

You can read books purchased on Google Play using your computer's web browser.

eReaders and other devices

To read on e-ink devices like the Sony eReader or Barnes & Noble Nook, you'll need to download a file and transfer it to your device. Please follow the detailed Help center instructions to transfer the files to supported eReaders.
This hands-on survival manual will give you the tools to confidently prepare for and respond to a system outage.Key FeaturesProven methods for keeping your website runningA survival guide for incident responseWritten by an ex-Google SRE expertBook Description

Real-World SRE is the go-to survival guide for the software developer in the middle of catastrophic website failure. Site Reliability Engineering (SRE) has emerged on the frontline as businesses strive to maximize uptime. This book is a step-by-step framework to follow when your website is down and the countdown is on to fix it.

Nat Welch has battle-hardened experience in reliability engineering at some of the biggest outage-sensitive companies on the internet. Arm yourself with his tried-and-tested methods for monitoring modern web services, setting up alerts, and evaluating your incident response.

Real-World SRE goes beyond just reacting to disaster—uncover the tools and strategies needed to safely test and release software, plan for long-term growth, and foresee future bottlenecks. Real-World SRE gives you the capability to set up your own robust plan of action to see you through a company-wide website crisis.

The final chapter of Real-World SRE is dedicated to acing SRE interviews, either in getting a first job or a valued promotion.

What you will learnMonitor for approaching catastrophic failureAlert your team to an outage emergencyDissect your incident response strategiesTest automation tools and build your own softwarePredict bottlenecks and fight for user experienceEliminate the competition in an SRE interviewWho this book is for

Real-World SRE is aimed at software developers facing a website crisis, or who want to improve the reliability of their company's software. Newcomers to Site Reliability Engineering looking to succeed at interview will also find this invaluable.

Create, deploy, and manage applications at scale using SRE principlesKey FeaturesBuild and run highly available, scalable, and secure softwareExplore abstract SRE in a simplified and streamlined wayEnhance the reliability of cloud environments through SRE enhancementsBook Description

Site reliability engineering (SRE) is being touted as the most competent paradigm in establishing and ensuring next-generation high-quality software solutions.

This book starts by introducing you to the SRE paradigm and covers the need for highly reliable IT platforms and infrastructures. As you make your way through the next set of chapters, you will learn to develop microservices using Spring Boot and make use of RESTful frameworks. You will also learn about GitHub for deployment, containerization, and Docker containers. Practical Site Reliability Engineering teaches you to set up and sustain containerized cloud environments, and also covers architectural and design patterns and reliability implementation techniques such as reactive programming, and languages such as Ballerina and Rust. In the concluding chapters, you will get well-versed with service mesh solutions such as Istio and Linkerd, and understand service resilience test practices, API gateways, and edge/fog computing.

By the end of this book, you will have gained experience on working with SRE concepts and be able to deliver highly reliable apps and services.

What you will learnUnderstand how to achieve your SRE goalsGrasp Docker-enabled containerization conceptsLeverage enterprise DevOps capabilities and Microservices architecture (MSA)Get to grips with the service mesh concept and frameworks such as Istio and LinkerdDiscover best practices for performance and resiliencyFollow software reliability prediction approaches and enable patternsUnderstand Kubernetes for container and cloud orchestrationExplore the end-to-end software engineering process for the containerized worldWho this book is for

Practical Site Reliability Engineering helps software developers, IT professionals, DevOps engineers, performance specialists, and system engineers understand how the emerging domain of SRE comes handy in automating and accelerating the process of designing, developing, debugging, and deploying highly reliable applications and services.

Every manufacturing or systems engineer has grappled with questions like these --" How can we reduce the cost of testing our process or product? How do we know if our development process is robust? Where do the gaps lie in our manufacturing or testing process? How do we build a reliable, robust process that all stakeholders can count on?" Around these questions has risen a veritable industry of solutions, manufacturing standards, statistical methods and more. And yet, designing for reliability remains a little-understood and much-feared proposition.

Now design phase gate review and testing expert and veteran technical problem solver "Thim Gurunatha" brings to his readers a lifetime of experience in designing robust and reliable processes. In his new book, "Systems Engineering Standards -- The State of the Art ," Thim systematically tackles fundamental and esoteric problems that plague manufacturing and systems engineers today. Thim understands that while modern technologies, including computing technologies, have greatly aided today's engineers, they have also revealed gaps, cracks and chinks which were not apparent before. With this new book, Thim's mission is to close all the little gaps towards developing perfect processes.

Coming in to fill a critical void, Thim's new book teaches engineers to make the process of statistical process control (SPC) more efficient. Even the most seasoned engineers will learn how to make the design of experiments less expensive, reduce testing time and increase the accuracy of reliability predictions. The author lucidly articulates that the survival of companies in future may depend on the implementation of breakthrough strategies in problem solving. In such an environment, understanding and promoting the use of statistical tools becomes a management issue rather than an operator problem.

Used effectively, statistical methods greatly reduce problem-solving time. Because of the abundance of statistical tools, however, it is important to know which tools to use when -- and which tools not to use. Thim's direct-to-action book helps systems engineers pick the 'best of the best' tools for each application and assists its users in applying these tools, saving them millions of dollars. Surely readers can recession proof their careers with the wisdom in this brand new book!

What once seemed nearly impossible has turned into reality. The number of available Internet addresses is now nearly exhausted, due mostly to the explosion of commercial websites and entries from an expanding number of countries. This growing shortage has effectively put the Internet community--and some of its most brilliant engineers--on alert for the last decade.Their solution was to create IPv6, a new Internet standard which will ultimately replace the current and antiquated IPv4. As the new backbone of the Internet, this new protocol would fix the most difficult problems that the Internet faces today--scalability and management. And even though IPv6's implementation has met with some resistance over the past few years, all signs are now pointing to its gradual worldwide adoption in the very near future. Sooner or later, all network administrators will need to understand IPv6, and now is a good time to get started.IPv6 Network Administration offers administrators the complete inside info on IPv6. This book reveals the many benefits as well as the potential downsides of this next-generation protocol. It also shows readers exactly how to set up and administer an IPv6 network.A must-have for network administrators everywhere, IPv6 Network Administration delivers an even-handed approach to what will be the most fundamental change to the Internet since its inception. Some of the other IPv6 assets that are covered include:routingintegrated auto-configurationquality-of-services (QoS)enhanced mobilityend-to-end securityIPv6 Network Administration explains what works, what doesn't, and most of all, what's practical when considering upgrading networks from the current protocol to IPv6.
©2019 GoogleSite Terms of ServicePrivacyDevelopersArtistsAbout Google|Location: United StatesLanguage: English (United States)
By purchasing this item, you are transacting with Google Payments and agreeing to the Google Payments Terms of Service and Privacy Notice.