
谷歌新书名为《Building Secure & Reliable Systems(构建安全可靠的系统)》,重点介绍 Google 如何将 SRE 方法引入安全性,以及安全性在软件产品开发和运营中的作用。
Google 此前发布的关于 SRE 的书籍虽然涵盖了 SRE 的最佳实践,但没有涉及可靠性和安全性之间的联系。此次新发布的图书电子版共计 500 多页,详细介绍了影响 Google 内部系统和产品(如 YouTube)的众多故障。重要的是,新书还揭示了其站点可靠性工程和安全团队如何合作保护 Google 系统,从 Android 到 Chrome、Gmail、搜索和 Google Cloud。
本书开头提出了一个问题:“如果系统从根本上来说不是安全的,那么还可以认为它是真正可靠的吗?如果系统不可靠,那么可以认为它是安全的吗?”。
近日Google安全团队发布一本新书,叫《Building Secure & Reliable Systems》,由著名的O'Reilly出版社发行,用户可以购买纸质书,或者下载免费的电子书,可见他们在知识分享和基础安全建设贡献上,着实对安全行业的发展分享不少的经验,力行推动行业发展。
之前Google为了让亿万用户使用更加稳定可靠的服务,他们组建了一支专业的团队去负责此块工作,这个团队叫“Site Reliability Engineers (SREs)”(网站可靠性工程师),即DevOps的践行者,主要职责都是构建、部署、监控、维护软件系统等等,此书正是由该团队编写的。
关键内容
本书主要分享安全可靠系统构建过程中的:
- 设计策略
- 编码、测试和调试的实践建议
- 对事故的防御、响应和恢复建议
- 跨团队协作的最佳实践文化
全书一共 5 个部分,共 21 章。
第一部分 入门材料 (2章)
The Intersection of Security and Reliability (安全性和可靠性的交叉点)
Understanding Adversaries (了解对手)
第二部分 设计系统 (8章)
Case Study: Safe Proxies (案例研究:安全代理)
Design Tradeoffs (设计权衡)
Design for Least Privilege (最小权限设计)
Design for Understandability (可理解性设计)
Design for a Changing Landscape (不断变化的全景设计)
Design for Resilience (弹性设计)
Design for Recovery (恢复设计)
Mitigating Denial-of-Service Attacks (缓解DoS攻击)
第三部分 实施系统 (5章)
Case Study: Designing, Implementing, and Maintaining a Publicly Trusted CA (案例研究:设计、实施和维护一个公开可信的CA)
Writing Code (编写代码)
Testing Code (测试代码)
Deploying Code (部署代码)
Investigating Systems (调查系统)
第四部分 维护系统 (3章)
Disaster Planning (灾难计划)
Crisis Management (危机管理)
Recovery and Aftermath (恢复和后果)
第五部分 组织和文化 (3章)
Case Study: Chrome Security Team (案例研究:Chrome安全团队)
Understanding Roles and Responsibilities (理解角色和责任)
Building a Culture of Security and Reliability (建立安全和可靠性文化)
Part I. Introductory Material (入门材料)
The Intersection of Security and Reliability (安全性和可靠性的交叉点)
On Passwords and Power Drills (关于密码和电钻)
Reliability Versus Security: Design Considerations (可靠性与安全性:设计注意事项)
Confidentiality, Integrity, Availability (机密性、完整性和可用性)
Confidentiality (机密性)
Integrity (完整性)
Availability (可用性)
Reliability and Security: Commonalities (可靠性与安全性:共性)
Invisibility (不可见性)
Assessment (评估)
Simplicity (简单)
Evolution (演化)
Resilience (弹性)
From Design to Production (从设计到生产)
Investigating Systems and Logging (调查系统和记录)
Crisis Response (危机应对)
Recovery (恢复)
Conclusion (结论)
Understanding Adversaries (了解对手)
Attacker Motivations (攻击者的动机)
Attacker Profiles (攻击者的画像)
Hobbyists (业余爱好者)
Vulnerability Researchers (漏洞研究员)
Governments and Law Enforcement (政府和执法者)
Activists (活动家)
Criminal Actors (犯罪演员)
Automation and Artificial Intelligence (自动化和人工智能)
Insiders (内鬼)
Attacker Methods (攻击手法)
Threat Intelligence (威胁情报)
Cyber Kill Chains (网络杀伤链)
Tactics, Techniques, and Procedures (TTP, 战术技术和过程)
Risk Assessment Considerations (风险评估注事事项)
Conclusion (结论)
Part II. Designing Systems (设计系统)
Case Study: Safe Proxies (案例研究:安全代理)
Safe Proxies in Production Environments (生产环境中的安全代理)
Google Tool Proxy (Google工具代理)
Conclusion (结论)
Design Tradeoffs (设计权衡)
Design Objectives and Requirements (设计目标和要求)
Feature Requirements (功能要求)
Nonfunctional Requirements (非功能性要求)
Features Versus Emergent Properties (功能和突发事项)
Example: Google Design Document (示例:Google设计文档)
Balancing Requirements (平衡要求)
Example: Payment Processing (示例:付款处理)
Managing Tensions and Aligning Goals (处理紧张局势和调整目标)
Example: Microservices and the Google Web Application Framework (示例:微服务和Google的Web应用框架)
Aligning Emergent-Property Requirements (对齐突发事项的需求)
Initial Velocity Versus Sustained Velocity (初始速度和持续速度)
Conclusion (结论)
Design for Least Privilege (最小权限设计)
Concepts and Terminology (概念和术语)
Least Privilege (最小权限)
Zero Trust Networking (零信任网络)
Zero Touch (零接触)
Classifying Access Based on Risk (根据风险对访问进行分类)
Best Practices (最佳实践)
Small Functional APIs (小型功能性API)
Breakglass (走特批流程)
Auditing (审计)
Testing and Least Privilege (测试和最小权限)
Diagnosing Access Denials (诊断访问拒绝)
Graceful Failure and Breakglass Mechanisms (优雅的失败和breakglass机制)
Worked Example: Configuration Distribution (工作示例:配置分发)
POSIX API via OpenSSH (基于OpenSSH的POSIX的API)
Software Update API (软件更新API)
Custom OpenSSH ForceCommand (自定义OpenSSH的ForceCommand)
Custom HTTP Receiver (Sidecar) (自定义HTTP接收器,边车模式)
Custom HTTP Receiver (In-Process) (自定义HTTP接收器,直通模式)
Tradeoffs (权衡)
A Policy Framework for Authentication and Authorization Decisions (认证和授权决策的策略框架)
Using Advanced Authorization Controls (使用高级认证控制)
Investing in a Widely Used Authorization Framework (选择广泛使用的授权框架)
Avoiding Potential Pitfalls (避免潜在的陷阱)
Advanced Controls (高级控制)
Multi-Party Authorization (MPA) (多方授权)
Three-Factor Authorization (3FA) (三因素认证)
Business Justifications (商业理由)
Temporary Access (临时访问)
Proxies (代理)
Tradeoffs and Tensions (权衡与紧张)
Increased Security Complexity (持续增加的安全复杂性)
Impact on Collaboration and Company Culture (对合作和公司文化的影响)
Quality Data and Systems That Impact Security (影响安全性的质量数据和系统)
Impact on User Productivity (对用户生产力的影响)
Impact on Developer Complexity (对开发人员复杂度的影响)
Conclusion (结论)
Design for Understandability (可理解性设计)
Why Is Understandability Important? (为什么可理解性如此重要)
System Invariants (系统不变式)
Analyzing Invariants (分析不变式)
Mental Models (心智模式)
Designing Understandable Systems (设计可理解的系统)
Complexity Versus Understandability (复杂性和可理解性)
Breaking Down Complexity (打破复杂性)
Centralized Responsibility for Security and Reliability Requirements (要求安全性和可靠性的集中责任)
System Architecture (系统架构)
Understandable Interface Specifications (可理解的接口规范)
Understandable Identities, Authentication, and Access Control (可理解的身份、认证和访问控制)
Security Boundaries (安全边界)
Software Design (软件设计)
Using Application Frameworks for Service-Wide Requirements (使用应用框架满足服务需求)
Understanding Complex Data Flows (理解复杂的数据流)
Considering API Usability (考虑API的可用性)
Conclusion (结论)
Design for a Changing Landscape (不断变化的全景设计)
Types of Security Changes (安全变更的类型)
Designing Your Change (设计变更)
Architecture Decisions to Make Changes Easier (简化变更的架构决策)
Keep Dependencies Up to Date and Rebuild Frequently (保持依赖关系更新和经常性的重建)
Release Frequently Using Automated Testing (频繁使用自动化测试发布)
Use Containers (使用容器)
Use Microservices (使用微服务)
Different Changes: Different Speeds, Different Timelines (不同的变化:不同的速度,不同的时间线)
Short-Term Change: Zero-Day Vulnerability (短期改变:0day漏洞)
Medium-Term Change: Improvement to Security Posture (中期改变:安全态势的改善)
Long-Term Change: External Demand (长期改变:外部需求)
Complications: When Plans Change (并发症:当计划发生改变)
Example: Growing Scope—Heartbleed (示例:不断扩大的范围以致陷入困境)
Conclusion (结论)
Design for Resilience (弹性设计)
Design Principles for Resilience (弹性设计的原则)
Defense in Depth (深度防御)
The Trojan Horse (特洛伊木马)
Google App Engine Analysis (Google App engine分析)
Controlling Degradation (控制降级)
Differentiate Costs of Failures (差异化的失败成本)
Deploy Response Mechanisms (部署响应机制)
Automate Responsibly (负责任的自动化)
Controlling the Blast Radius (控制爆炸半径)
Role Separation (角色分离)
Location Separation (位置分离)
Time Separation (时间分离)
Failure Domains and Redundancies (失败域和冗余)
Failure Domains (失败域)
Component Types (组件类型)
Controlling Redundancies (控制冗余)
Continuous Validation (持续验证)
Validation Focus Areas (验证重点领域)
Validation in Practice (实践验证)
Practical Advice: Where to Begin (实用建议:从哪里开始)
Conclusion (结论)
Design for Recovery (恢复设计)
What Are We Recovering From? (我们从哪开始恢复)
Random Errors (随机错误)
Accidental Errors (意外错误)
Software Errors (软件错误)
Malicious Actions (恶意行为)
Design Principles for Recovery (恢复的设计原则)
Design to Go as Quickly as Possible (Guarded by Policy) (受政策保护的:尽快恢复)
Limit Your Dependencies on External Notions of Time (限制对外部时间观念的依赖)
Rollbacks Represent a Tradeoff Between Security and Reliability (回滚呈现了安全性和可靠性之间的权衡)
Use an Explicit Revocation Mechanism (使用显示吊销机制)
Know Your Intended State, Down to the Bytes (知道你的预期状态,细到字节粒度)
Design for Testing and Continuous Validation (可测试和持续性验证的设计)
Emergency Access (紧急通道)
Access Controls (访问控制)
Communications (通讯)
Responder Habits (响应者的习惯)
Unexpected Benefits (意外的好处)
Conclusion (结论)
Mitigating Denial-of-Service Attacks (缓解DOS攻击)
Strategies for Attack and Defense (攻击和防御的策略)
Attacker’s Strategy (攻击者的策略)
Defender’s Strategy (防御者的策略)
Designing for Defense (防御设计)
Defendable Architecture (可防御的架构)
Defendable Services (可防御的服务)
Mitigating Attacks (缓解攻击)
Monitoring and Alerting (监控和报警)
Graceful Degradation (优雅降级)
A DoS Mitigation System (一个DoS缓解系统)
Strategic Response (策略响应)
Dealing with Self-Inflicted Attacks (处理自己造成的攻击)
User Behavior (用户行为)
Client Retry Behavior (客服端重试行为)
Conclusion (结论)
Part III. Implementing Systems (实施系统)
Case Study: Designing, Implementing, and Maintaining a Publicly Trusted CA (案例研究:设计、实现和维护一个公开可信的CA)
Background on Publicly Trusted Certificate Authorities (关于公开可信CA的背景介绍)
Why Did We Need a Publicly Trusted CA? (为什么我们需要一个公开可信的CA)
The Build or Buy Decision (构建或购买决策)
Design, Implementation, and Maintenance Considerations (设计、实现和维护的注意事项)
Programming Language Choice (编程语言的选择)
Complexity Versus Understandability (复杂性和可理解性)
Securing Third-Party and Open Source Components (保护第三方和开源组件)
Testing (测试)
Resiliency for the CA Key Material (CA关键材料的弹性)
Data Validation (数据验证)
Conclusion (结论)
Writing Code (编写代码)
Frameworks to Enforce Security and Reliability (增强安全性和可靠性的框架)
Benefits of Using Frameworks (使用框架的好处)
Example: Framework for RPC Backends (示例:RPC后端框架)
Common Security Vulnerabilities (常见安全漏洞)
SQL Injection Vulnerabilities: TrustedSqlString (SQL注入:可信任的SQL字符串)
Preventing XSS: SafeHtml (防御XSS:SafeHtml)
Lessons for Evaluating and Building Frameworks (评估和构建框架的经验教训)
Simple, Safe, Reliable Libraries for Common Tasks (用于常见任务的简单、安全、可靠的库)
Rollout Strategy (推广策略)
Simplicity Leads to Secure and Reliable Code (简单从而保证安全和可靠的代码)
Avoid Multilevel Nesting (避免多层嵌套)
Eliminate YAGNI Smells (消除YAGNI气味)
Repay Technical Debt (偿还技术债)
Refactoring (重构)
Security and Reliability by Default (默认安全和可靠)
Choose the Right Tools (选择正确的工具)
Use Strong Types (使用强类型语言)
Sanitize Your Code (净化你的代码)
Conclusion (结论)
Testing Code (测试代码)
Unit Testing (单元测试)
Writing Effective Unit Tests (编写有效的单元测试)
When to Write Unit Tests (什么时候编写单元测试)
How Unit Testing Affects Code (单元测试是如何影响代码的)
Integration Testing (集成测试)
Writing Effective Integration Tests (编写有效的集成测试)
Dynamic Program Analysis (动态代码分析)
Fuzz Testing (模糊测试)
How Fuzz Engines Work (模糊测试引擎是如何工作的)
Writing Effective Fuzz Drivers (编写有效的模糊测试驱动程序)
An Example Fuzzer (一个模糊测试程序的例子)
Continuous Fuzzing (持续模糊测试)
Static Program Analysis (静态代码分析)
Automated Code Inspection Tools (自动化代码检查工具)
Integration of Static Analysis in the Developer Workflow (在开发工作流中集成静态代码分析)
Abstract Interpretation (抽象的解释)
Formal Methods (正式的方法)
Conclusion (结论)
Deploying Code (部署代码)
Concepts and Terminology (概念和术语)
Threat Model (威胁模型)
Best Practices (最佳实践)
Require Code Reviews (需要代码审查)
Rely on Automation (依赖于自动化)
Verify Artifacts, Not Just People (验证artifacts而不仅仅是人)
Treat Configuration as Code (将配置和代码等同视之)
Securing Against the Threat Model (基于威胁模型做加固)
Advanced Mitigation Strategies (高级缓解策略)
Binary Provenance (二进制来源验证)
Provenance-Based Deployment Policies (Provenance-Based发布策略)
Verifiable Builds (可校验的构建)
Deployment Choke Points (部署卡点)
Post-Deployment Verification (部署后的验证)
Practical Advice (实用的建议)
Take It One Step at a Time (一步一步来)
Provide Actionable Error Messages (提供可操作的错误信息)
Ensure Unambiguous Provenance (确保来源没有问题)
Create Unambiguous Policies (创建没有歧义的政策)
Include a Deployment Breakglass (包括一个部署特批流程)
Securing Against the Threat Model, Revisited (回顾基于威胁模型做的加固)
Conclusion (结论)
Investigating Systems (调查系统)
From Debugging to Investigation (从调试到调查)
Example: Temporary Files (示例:临时文件)
Debugging Techniques (调试技术)
What to Do When You’re Stuck (当你被困住的时候该做些什么)
Collaborative Debugging: A Way to Teach (协作调试:一种教学方法)
How Security Investigations and Debugging Differ (安全调查和调试有何不同)
Collect Appropriate and Useful Logs (收集合适和有用的日志)
Design Your Logging to Be Immutable (将你的日志系统设计为不可修改的)
Take Privacy into Consideration (将隐私纳入考虑范围)
Determine Which Security Logs to Retain (决定要保留哪些安全日志)
Budget for Logging (日志的预算)
Robust, Secure Debugging Access (健壮、安全的调试访问)
Reliability (可靠性)
Security (安全性)
Conclusion (结论)
Part IV. Maintaining Systems (维护系统)
Disaster Planning (灾难计划)
Defining “Disaster” (定义灾难)
Dynamic Disaster Response Strategies (动态灾难响应策略)
Disaster Risk Analysis (灾难风险分析)
Setting Up an Incident Response Team (成立事件响应团队)
Identify Team Members and Roles (确定团队成员和角色)
Establish a Team Charter (建立团队章程)
Establish Severity and Priority Models (建立严重性和优先级模型)
Define Operating Parameters for Engaging the IR Team (明确和IR团队合作的操作参数)
Develop Response Plans (制定响应计划)
Create Detailed Playbooks (创建详细的剧本)
Ensure Access and Update Mechanisms Are in Place (确保访问和更新机制到位)
Prestaging Systems and People Before an Incident (在事件发生前预先准备好系统和人员)
Configuring Systems (配置系统)
Training (培训)
Processes and Procedures (过程和程序)
Testing Systems and Response Plans (测试系统和响应计划)
Auditing Automated Systems (审计自动化系统)
Conducting Nonintrusive Tabletops (实施非侵入式桌面)
Testing Response in Production Environments (在生产环境中测试响应)
Red Team Testing (红队测试)
Evaluating Responses (评估响应)
Google Examples (Google的例子)
Test with Global Impact (具有全球影响的测试)
DiRT Exercise Testing Emergency Access (DiRT演习测试应急通道)
Industry-Wide Vulnerabilities (行业级别的漏洞)
Conclusion (结论)
Crisis Management (危机管理)
Is It a Crisis or Not? (这是一场危机吗)
Triaging the Incident (对事件进行分类)
Compromises Versus Bugs (妥协于错误)
Taking Command of Your Incident (掌控事件)
The First Step: Don’t Panic! (第一步:不要惊慌)
Beginning Your Response (开始回应)
Establishing Your Incident Team (建立事件响应团队)
Operational Security (运营安全)
Trading Good OpSec for the Greater Good (以良好的运营安全换取更大的成果)
The Investigative Process (调查过程)
Keeping Control of the Incident (控制事件)
Parallelizing the Incident (并行化事件)
Handovers (交接)
Morale (士气)
Communications (沟通)
Misunderstandings (误解)
Hedging (限制)
Meetings (会议)
Keeping the Right People Informed with the Right Levels of Detail (让正确的人了解合适的细节)
Putting It All Together (放在一起)
Triage (分类)
Declaring an Incident (宣布事件)
Communications and Operational Security (沟通和操作安全)
Beginning the Incident (开始事件)
Handover (移交)
Handing Back the Incident (归还事件)
Preparing Communications and Remediation (准备沟通和补救)
Closure (关闭)
Conclusion (结论)
Recovery and Aftermath (恢复和后果)
Recovery Logistics (恢复逻辑)
Recovery Timeline (恢复时间线)
Planning the Recovery (规划恢复)
Scoping the Recovery (确定要恢复的范围)
Recovery Considerations (恢复注意事项)
Recovery Checklists (恢复检查清单)
Initiating the Recovery (启动恢复)
Isolating Assets (Quarantine) (隔离资产)
System Rebuilds and Software Upgrades (系统重建和软件升级)
Data Sanitization (数据消毒)
Recovery Data (恢复数据)
Credential and Secret Rotation (凭证和秘钥轮换)
After the Recovery (恢复之后)
Postmortems (尸检)
Examples (示例)
Compromised Cloud Instances (被攻陷的云实例)
Large-Scale Phishing Attack (大规模钓鱼攻击)
Targeted Attack Requiring Complex Recovery (有针对性的攻击需要复杂的恢复)
Conclusion (结论)
Part V. Organization and Culture (组织和文化)
Case Study: Chrome Security Team (案例研究:Chrome安全团队)
Background and Team Evolution (背景和团队发展)
Security Is a Team Responsibility (安全是团队的责任)
Help Users Safely Navigate the Web (帮助用户安全的浏览网页)
Speed Matters (速度问题)
Design for Defense in Depth (纵深防御设计)
Be Transparent and Engage the Community (透明化并与社区互动)
Conclusion (结论)
Understanding Roles and Responsibilities (理解角色和责任)
Who Is Responsible for Security and Reliability? (谁应该为安全性和可靠性负责)
The Roles of Specialists (专家的角色)
Understanding Security Expertise (了解安全专业知识)
Certifications and Academia (认证和学术界)
Integrating Security into the Organization (将安全整合到组织中)
Embedding Security Specialists and Security Teams (内嵌安全专家和安全团队)
Example: Embedding Security at Google (示例:在Google中嵌入安全性)
Special Teams: Blue and Red Teams (特殊团队:蓝队和红队)
External Researchers (外部研究人员)
Conclusion (结论)
Building a Culture of Security and Reliability (建立安全和可靠性的文化)
Defining a Healthy Security and Reliability Culture (定义一个健康的安全和可靠性文化)
Culture of Security and Reliability by Default (默认安全和可靠的文化)
Culture of Review (反思回顾的文化)
Culture of Awareness (意识文化)
Culture of Yes (Yes文化)
Culture of Inevitably (不可避免的文化)
Culture of Sustainability (可持续发展的文化)
Changing Culture Through Good Practice (通过良好实践改变文化)
Align Project Goals and Participant Incentives (对齐项目目标和参与者激励)
Reduce Fear with Risk-Reduction Mechanisms (通过风险减轻机制减少恐惧)
Make Safety Nets the Norm (让安全网成为规范)
Increase Productivity and Usability (提高生产力和可用性)
Overcommunicate and Be Transparent (过度沟通并保持透明)
Build Empathy (建立同理心)
Convincing Leadership (令人信服的领导)
Understand the Decision-Making Process (了解决策过程)
Build a Case for Change (说明变化的原因)
Pick Your Battles (选择你的战斗)
Escalations and Problem Resolution (升级和问题解决)
Conclusion (结论)
Appendix. A Disaster Risk Assessment Matrix (附录:一个灾难风险评估矩阵)
下载地址