Design Yelp

Proximity Server

Posted by CodingMrWang on February 5, 2022

This post is first created by CodingMrWang, 作者 @Zexian Wang ,please keep the original link if you want to repost it.

What is Proximity Server

Proximity servers are used to discover nearby attractions like places,

Requirements

Functional requirements

  • User should be able to add/delete/update places.
  • Given their location (longitude/latitude), users should be able to find all nearby places within a given radius.
  • Users should be able to add feedback/review about a place.

Non Functional requirements

  • Low latency
  • Scability

Estimation

Traffic

500M DAU search three times/day 100K restaurant update/day

data size

793B/restaurant metadata (locationId 8B, Name 256B, latitude 8B, longitude 8B, description 512B, category 1b)

QPS

read: 500M * 3/86400 = 17360/s write: 100K/86400=1.2/s

Storage

500M restaurants in total 500M * 1KB = 500GB

API

search(user_id, search_terms, user_location, radius_filter, maximum_results_to_return_category_filter, sort, page_token) return: a json containing information about a list of businesses matching the search query. Each result entry will have the business name, address, category, rating and thumbnail.

Diagram

Database

This will be a read heavy but write light application, users will search restaurants everyday but they won’t always update restaurant info.

  • Naive solution: Store all data in a relational database and use query like select * from places where latitude between X-D and X+D and longitude between Y-D and Y+D This query is really inefficient, for each condition, it will retrieve many records and performing the intersection on two lists will be slow.

  • Grids: We can divide the whole map into smaller grids to group locations into smaller sets. Each grid will store all the places within the range of longitude and latitude. We can use a NoSQL to store places like dynamoDB/cassandra, using region like zip code as hash key and gridId as range Key to build index.

  • Keep index in memory: We can keep our index in a hash table where key is the grid number and value is the list of places contained in that grid. If we have 20 millions grids, we would need a four bytes number to uniquely identify each grid. (4 * 20M) + (8 * 500M) = 4GB while this solution can run slow if grids have a lot of places since they may not uniformly distributed among grids.

  • Dynamic size grids: We can limit the size of each grid to 500. Use a tree to store grids, if there are more than 500 places in the grid, split it into multiple children.

How we build a QuadTree? We start with root and break it down into four nodes and distribute locations among them, we keep repeating the process with each child node until there are no nodes left with more than 500 locations. How we find the grid for a given location? We will start with the root node and search downward to find our required node. at each step, we will see if the current node we are visiting has children, if it has, we will move to the child node that contains our desired location and repeat this process. If the node doesn’t have any children, then that is our desired node. How will we find neighboring grids of a given grid? We can connect all leaf nodes with a double linked list. Then we can iterate forward and backward among neighboring leaf nodes to find out our desired locations. How much memory will be needed to store the QuadTree 24 * 500M = 12GB 500M / 500 = 1M grids This can be easily fit into a modern-day server.

Data Partitioning

Share QuadTree into multiple machines

  • Shard based on regions: may cause hot partition, can let load balancer to handle that.
  • Share based on locationId: When we search, we need query all servers and send to a centralized server.

Replication and fault tolerance

Having multiple replica of QuadTree. If loss all QuadTree, build it by scanning all data in database.

Cache

Save hot places info in cache. can use LRU cache.

Another solution

GeoHash