In this collection of essays and articles, key members of Google’s Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You’ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.
This book is divided into four sections:
Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland’s peering hub. He is the author or coauthor of a number of technical papers and/or books, including "IPv6 Network Administration" for O’Reilly, and a number of RFCs. He is currently cowriting a history of the Internet in Ireland, and is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.
Betsy Beyer is a Technical Writer for Google Site Reliability Engineering in NYC. She has previously written documentation for Google Datacenters and Hardware Operations teams. Before moving to New York, Betsy was a lecturer on technical writing at Stanford University.
Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google’s advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He’s also a licensed professional engineer.
Jennifer Petoff is a Program Manager for Google’s Site Reliability Engineering team and based in Dublin, Ireland. She has managed large global projects across wide-ranging domains including scientific research, engineering, human resources, and advertising operations. Jennifer joined Google after spending eight years in the chemical industry. She holds a PhD in Chemistry from Stanford University and a BS in Chemistry and a BA in Psychology from the University of Rochester.
SRE is a large and rich topic to discuss. Google led the way with Site Reliability Engineering, the wildly successful O’Reilly book that described Google’s creation of the discipline and the implementation that’s allowed them to operate at a planetary scale. Inspired by that earlier work, this book explores a very different part of the SRE space. The more than two dozen chapters in Seeking SRE bring you into some of the important conversations going on in the SRE world right now.
Listen as engineers and other leaders in the field discuss:Different ways of implementing SRE and SRE principles in a wide variety of settingsHow SRE relates to other approaches such as DevOpsSpecialties on the cutting edge that will soon be commonplace in SREBest practices and technologies that make practicing SRE easierThe important but rarely explored human side of SRE
David N. Blank-Edelman is the book’s curator and editor.
The added chapters contain (1) a crisp condensation of all the propositions asserted in the original book, including Brooks' central argument in The Mythical Man-Month: that large programming projects suffer management problems different from small ones due to the division of labor; that the conceptual integrity of the product is therefore critical; and that it is difficult but possible to achieve this unity; (2) Brooks' view of these propositions a generation later; (3) a reprint of his classic 1986 paper "No Silver Bullet"; and (4) today's thoughts on the 1986 assertion, "There will be no silver bullet within ten years."
Effective design is at the heart of everything from software development to engineering to architecture. But what do we really know about the design process? What leads to effective, elegant designs? The Design of Design addresses these questions.
These new essays by Fred Brooks contain extraordinary insights for designers in every discipline. Brooks pinpoints constants inherent in all design projects and uncovers processes and patterns likely to lead to excellence. Drawing on conversations with dozens of exceptional designers, as well as his own experiences in several design domains, Brooks observes that bold design decisions lead to better outcomes.
The author tracks the evolution of the design process, treats collaborative and distributed design, and illuminates what makes a truly great designer. He examines the nuts and bolts of design processes, including budget constraints of many kinds, aesthetics, design empiricism, and tools, and grounds this discussion in his own real-world examples—case studies ranging from home construction to IBM’s Operating System/360. Throughout, Brooks reveals keys to success that every designer, design project manager, and design researcher should know.
Through carefully analyzed case studies from each of these disciplines, it demonstrates how to apply these concepts to tackle practical system design problems. To support the focus on design, the text identifies and explains abstractions that have proven successful in practice such as remote procedure call, client/service organization, file systems, data integrity, consistency, and authenticated messages. Most computer systems are built using a handful of such abstractions. The text describes how these abstractions are implemented, demonstrates how they are used in different systems, and prepares the reader to apply them in future designs.
The book is recommended for junior and senior undergraduate students in Operating Systems, Distributed Systems, Distributed Operating Systems and/or Computer Systems Design courses; and professional computer systems designers.Features: Concepts of computer system design guided by fundamental principles.Cross-cutting approach that identifies abstractions common to networking, operating systems, transaction systems, distributed systems, architecture, and software engineering.Case studies that make the abstractions real: naming (DNS and the URL); file systems (the UNIX file system); clients and services (NFS); virtualization (virtual machines); scheduling (disk arms); security (TLS).Numerous pseudocode fragments that provide concrete examples of abstract concepts.Extensive support. The authors and MIT OpenCourseWare provide on-line, free of charge, open educational resources, including additional chapters, course syllabi, board layouts and slides, lecture videos, and an archive of lecture schedules, class assignments, and design projects.
In this collection of over 80 columns, Eric Brechner's alter ego pulls no punches with his candid commentary and best practice solutions to the issues that irk him the most. He dissects the development process, examines tough team issues, and critiques how the software business is run, with the added touch of clever humor and sardonic wit. His ideas aren't always popular (not that he cares), but they do stimulate discussion and imagination needed to drive software excellence.
Get the unvarnished truth on how to:Improve software quality and value—from design to security Realistically manage project schedules, risks, and specs Trim the fat from common development inefficiencies Apply process improvement methods—without being an inflexible fanatic Drive your own successful, satisfying career Don't be a dictator—develop and manage a thriving team!
Companion Web site includes:Agile process documents Checklists, templates, and other resources
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:How to run reliable services in environments you don’t completely control—like cloudPractical applications of how to create, monitor, and run your services via Service Level ObjectivesHow to convert existing ops teams to SRE—including how to dig out of operational overloadMethods for starting SRE from either greenfield or brownfield
Scaling isn’t just about handling more users; it’s also about managing risk and ensuring availability. Author Lee Atchison provides basic techniques for building applications that can handle huge quantities of traffic, data, and demand without affecting the quality your customers expect.
In five parts, this book explores:Availability: learn techniques for building highly available applications, and for tracking and improving availability going forwardRisk management: identify, mitigate, and manage risks in your application, test your recovery/disaster plans, and build out systems that contain fewer risksServices and microservices: understand the value of services for building complicated applications that need to operate at higher scaleScaling applications: assign services to specific teams, label the criticalness of each service, and devise failure scenarios and recovery plansCloud services: understand the structure of cloud-based services, resource allocation, and service distribution
Throughout the book, author Peter McGowan provides a combination of theoretical information and practical coding examples to help you learn the Automate object model. With this CloudForms feature, you can create auto-scalable cloud applications, eliminate manual decisions and operations when provisioning virtual machines and cloud instances, and manage your complete virtual machine lifecycle.
In six parts, this book helps you:Learn the objects and concepts for developing automation scripts with CloudForms AutomateCustomize the steps and workflows involved in provisioning virtual machinesCreate and use service catalogs, items, dialogs, objects, bundles, and hierarchiesUse CloudForm’s updated workflow to retire and delete virtual machines and servicesOrchestrate and coordinate with external services as part of a workflowExplore distributed automation processing as well as argument passing and handling