设计 Pastebin.com (或者 Bit.ly) 注意: 为了避免重复,当前文档会直接链接到系统设计主题的相关区域,请参考链接内容以获得综合的讨论点、权衡和替代方案。 设计 Bit.ly - 是一个类似的问题,区别是 pastebin 需要存储的是 paste 的内容,而不是原始的未短化的 url。 第一步:概述用例和约束 收集这个问题的需求和范畴。 问相关问题来明确用例和约束。 讨论一些假设。 因为没有面试官来明确这些问题,所以我们自己将定义一些用例和约束。
注意: 为了避免重复,当前文档会直接链接到系统设计主题的相关区域,请参考链接内容以获得综合的讨论点、权衡和替代方案。
设计 Bit.ly - 是一个类似的问题,区别是 pastebin 需要存储的是 paste 的内容,而不是原始的未短化的 url。
收集这个问题的需求和范畴。
问相关问题来明确用例和约束。
讨论一些假设。
因为没有面试官来明确这些问题,所以我们自己将定义一些用例和约束。
向面试官说明你是否应该粗略计算一下使用情况。
shortlink - 7 bytesexpiration_length_in_minutes - 4 bytescreated_at - 5 bytespaste_path - 255 bytes简单的转换指南:
概述一个包括所有重要的组件的高层次设计

深入每一个核心组件的细节
我们可以用一个 关系型数据库作为一个大的哈希表,用来把生成的 url 映射到一个包含 paste 文件的文件服务器和路径上。
为了避免托管一个文件服务器,我们可以用一个托管的对象存储,比如 Amazon 的 S3 或者NoSQL 文档类型存储。
作为一个大的哈希表的关系型数据库的替代方案,我们可以用NoSQL 键值存储。我们需要讨论选择 SQL 或 NoSQL 之间的权衡。下面的讨论是使用关系型数据库方法。
pastes 表里面向面试官阐明你需要写多少代码
pastes 表可以有如下结构:
shortlink char(7) NOT NULL expiration_length_in_minutes int NOT NULL created_at datetime NOT NULL paste_path varchar(255) NOT NULL PRIMARY KEY(shortlink)
我们将在 shortlink 字段和 created_at 字段上创建一个数据库索引,用来提高查询的速度(避免因为扫描全表导致的长时间查询)并将数据保存在内存中,从内存里面顺序读取 1MB 的数据需要大概 250 微秒,而从 SSD 上读取则需要花费 4 倍的时间,从硬盘上则需要花费 80 倍的时间。 1
为了生成唯一的 url,我们可以:
[a-zA-Z0-9] 是比较合适的+ 和 - 字符串而产生一些问题def base_encode(num, base=62): digits = [] while num > 0 remainder = modulo(num, base) digits.push(remainder) num = divide(num, base) digits = digits.reverse
url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
我们将会用一个公开的 REST 风格接口:
$ curl -X POST --data '{"expiration_length_in_minutes":"60", \"paste_contents":"Hello World!"}' https://pastebin.com/api/v1/paste
Response:
{ "shortlink": "foobar" }
用于内部通信,我们可以用 RPC。
REST API:
curl https://pastebin.com/api/v1/paste?shortlink=foobar
Response:
{ "paste_contents": "Hello World", "created_at": "YYYY-MM-DD HH:MM:SS", "expiration_length_in_minutes": "60" }
因为实时分析不是必须的,所以我们可以简单的 MapReduce Web Server 的日志,用来生成点击次数。
class HitCounts(MRJob): def extract_url(self, line): """Extract the generated url from the log line.""" ... def extract_year_month(self, line): """Return the year and month portions of the timestamp.""" ... def mapper(self, _, line): """Parse each log line, extract and transform relevant lines. Emit key value pairs of the form: (2016-01, url0), 1 (2016-01, url0), 1 (2016-01, url1), 1 """ url = self.extract_url(line) period = self.extract_year_month(line) yield (period, url), 1 def reducer(self, key, values): """Sum values for each key. (2016-01, url0), 2 (2016-01, url1), 1 """ yield key, sum(values)
为了删除过期的 pastes,我们可以直接搜索 SQL 数据库 中所有的过期时间比当前时间更早的记录,
所有过期的记录将从这张表里面删除(或者将其标记为过期)。
给定约束条件,识别和解决瓶颈。

重要提示: 不要简单的从最初的设计直接跳到最终的设计
说明您将迭代地执行这样的操作:1)Benchmark/Load 测试,2)Profile 出瓶颈,3)在评估替代方案和权衡时解决瓶颈,4)重复前面,可以参考在 AWS 上设计一个可以支持百万用户的系统这个用来解决如何迭代地扩展初始设计的例子。
重要的是讨论在初始设计中可能遇到的瓶颈,以及如何解决每个瓶颈。比如,在多个 Web 服务器 上添加 负载平衡器 可以解决哪些问题? CDN 解决哪些问题?Master-Slave Replicas 解决哪些问题? 替代方案是什么和怎么对每一个替代方案进行权衡比较?
我们将介绍一些组件来完成设计,并解决可伸缩性问题。内部的负载平衡器并不能减少杂乱。
为了避免重复的讨论, 参考以下系统设计主题获取主要讨论要点、权衡和替代方案:
分析存储数据库 可以用比如 Amazon Redshift 或者 Google BigQuery 这样的数据仓库解决方案。
一个像 Amazon S3 这样的 对象存储,可以轻松处理每月 12.7 GB 的新内容约束。
要处理 平均 每秒 40 读请求(峰值更高),其中热点内容的流量应该由 内存缓存 处理,而不是数据库。内存缓存 对于处理分布不均匀的流量和流量峰值也很有用。只要副本没有陷入复制写的泥潭,SQL Read Replicas 应该能够处理缓存丢失。
对于单个 SQL Write Master-Slave,平均 每秒 4paste 写入 (峰值更高) 应该是可以做到的。否则,我们需要使用额外的 SQL 扩展模式:
我们还应该考虑将一些数据移动到 NoSQL 数据库。
是否更深入探讨额外主题,取决于问题的范围和面试剩余的时间。
参考安全。
Note: This document links directly to relevant areas found in the system design topics to avoid duplication. Refer to the linked content for general talking points, tradeoffs, and alternatives.
Design Bit.ly - is a similar question, except pastebin requires storing the paste contents instead of the original unshortened url.
Gather requirements and scope the problem.
Ask questions to clarify use cases and constraints.
Discuss assumptions.
Without an interviewer to address clarifying questions, we'll define some use cases and constraints.
Clarify with your interviewer if you should run back-of-the-envelope usage calculations.
shortlink - 7 bytesexpiration_length_in_minutes - 4 bytescreated_at - 5 bytespaste_path - 255 bytesHandy conversion guide:
Outline a high level design with all important components.

Dive into details for each core component.
We could use a relational database as a large hash table, mapping the generated url to a file server and path containing the paste file.
Instead of managing a file server, we could use a managed Object Store such as Amazon S3 or a NoSQL document store.
An alternative to a relational database acting as a large hash table, we could use a NoSQL key-value store. We should discuss the tradeoffs between choosing SQL or NoSQL. The following discussion uses the relational database approach.
pastes tableClarify with your interviewer how much code you are expected to write.
The pastes table could have the following structure:
shortlink char(7) NOT NULL expiration_length_in_minutes int NOT NULL created_at datetime NOT NULL paste_path varchar(255) NOT NULL PRIMARY KEY(shortlink)
Setting the primary key to be based on the shortlink column creates an index that the database uses to enforce uniqueness. We'll create an additional index on created_at to speed up lookups (log-time instead of scanning the entire table) and to keep the data in memory. Reading 1 MB sequentially from memory takes about 250 microseconds, while reading from SSD takes 4x and from disk takes 80x longer.1
To generate the unique url, we could:
[a-zA-Z0-9] which works well for urls, eliminating the need for escaping special characters+ and / charactersdef base_encode(num, base=62): digits = [] while num > 0 remainder = modulo(num, base) digits.push(remainder) num = divide(num, base) digits = digits.reverse
url = base_encode(md5(ip_address+timestamp))[:URL_LENGTH]
We'll use a public REST API:
$ curl -X POST --data '{ "expiration_length_in_minutes": "60", \ "paste_contents": "Hello World!" }' https://pastebin.com/api/v1/paste
Response:
{ "shortlink": "foobar" }
For internal communications, we could use Remote Procedure Calls.
REST API:
$ curl https://pastebin.com/api/v1/paste?shortlink=foobar
Response:
{ "paste_contents": "Hello World" "created_at": "YYYY-MM-DD HH:MM:SS" "expiration_length_in_minutes": "60" }
Since realtime analytics are not a requirement, we could simply MapReduce the Web Server logs to generate hit counts.
Clarify with your interviewer how much code you are expected to write.
class HitCounts(MRJob): def extract_url(self, line): """Extract the generated url from the log line.""" ... def extract_year_month(self, line): """Return the year and month portions of the timestamp.""" ... def mapper(self, _, line): """Parse each log line, extract and transform relevant lines. Emit key value pairs of the form: (2016-01, url0), 1 (2016-01, url0), 1 (2016-01, url1), 1 """ url = self.extract_url(line) period = self.extract_year_month(line) yield (period, url), 1 def reducer(self, key, values): """Sum values for each key. (2016-01, url0), 2 (2016-01, url1), 1 """ yield key, sum(values)
To delete expired pastes, we could just scan the SQL Database for all entries whose expiration timestamp are older than the current timestamp. All expired entries would then be deleted (or marked as expired) from the table.
Identify and address bottlenecks, given the constraints.

Important: Do not simply jump right into the final design from the initial design!
State you would do this iteratively: 1) Benchmark/Load Test, 2) Profile for bottlenecks 3) address bottlenecks while evaluating alternatives and trade-offs, and 4) repeat. See Design a system that scales to millions of users on AWS as a sample on how to iteratively scale the initial design.
It's important to discuss what bottlenecks you might encounter with the initial design and how you might address each of them. For example, what issues are addressed by adding a Load Balancer with multiple Web Servers? CDN? Master-Slave Replicas? What are the alternatives and Trade-Offs for each?
We'll introduce some components to complete the design and to address scalability issues. Internal load balancers are not shown to reduce clutter.
To avoid repeating discussions, refer to the following system design topics for main talking points, tradeoffs, and alternatives:
The Analytics Database could use a data warehousing solution such as Amazon Redshift or Google BigQuery.
An Object Store such as Amazon S3 can comfortably handle the constraint of 12.7 GB of new content per month.
To address the 40 average read requests per second (higher at peak), traffic for popular content should be handled by the Memory Cache instead of the database. The Memory Cache is also useful for handling the unevenly distributed traffic and traffic spikes. The SQL Read Replicas should be able to handle the cache misses, as long as the replicas are not bogged down with replicating writes.
4 average paste writes per second (with higher at peak) should be do-able for a single SQL Write Master-Slave. Otherwise, we'll need to employ additional SQL scaling patterns:
We should also consider moving some data to a NoSQL Database.
Additional topics to dive into, depending on the problem scope and time remaining.
Refer to the security section.
See Latency numbers every programmer should know.