This post is first created by CodingMrWang, 作者 @Zexian Wang ,please keep the original link if you want to repost it.
Requirement
Functional
- Users should be able to upload or paste their data and get a unique URL to access it.
- Users will only be able to upload text.
- Data and links will expire after a specific timespan automatically.
Non-Functional
- Highly reliable (No data loss)
- Highly available
- Low latency
Estimation
Traffic
- New paste 1M/day
- Paste read 5M/day
-
data size
10 KB per paste
QPS
- 1M/86400 = 12/s
Throughput
- write: 120 KB/s
- read: 600 KB/s
Storage
- 1M * 10KB = 10GB/day
- If we store for 10 years, 36TB in total
Memory
- cache 20% of request, 5M * 0.2 * 10KB = 10GB
Service
- PasteBinService
API
- addPaste(api_dev_key, data, paste_name): URL(String)
- getPaste(api_dev_key, api_paste_key)
- deletePaste(api_dev_key, api_paste_key)
Data Storage
- We need to save billions of records.
- Each metadata is small (less than 1KB)
- Each paste object can be large, but on average 1000KB.
- No relationship
- Read-heavy
So we can use NoSQL for metadata storage and distributed file system to save paste content.
Schema
|Paste| |—–| |pasteId| |expirationDate| |userId| |creationDate|
Can be another user table
Workflow
Write Request:
- User save data
- Load balancer -> Server
- Server -> generate random key and save content into S3 bucket.
- Server -> save data into Metadata table.
- Server -> return URL to user
Key generation It is possible we generate an already used key, we have to check db if this key has been used before inserting the record. Another solution is that we can have a Key Generation Service to generate random six letters string beforehand and store them in a key-value db. When we store a new paste, we just take one from the db and delete the key from the db. It makes thing simple and fast. We can also keep some keys in memory to increase speed. Also replicate db to avoid single point of failure.
Diagram
Load balancing
- Load balancer between client and server
- Load balancer between server and database
- Load balancer between server and cache
Problem of round-robin LB: it assume all server has same compute power, also if some pastes are more popular, the server will have higher load. We can instead use Least connection load balancing. Send request to server to check load and allocate load to least busy server.
DB Clean
Run a job periodically delete expired links, also can use lazy delete, if found link is expired when user get it, delete it.