Types of DataBase for System Design Interview
In a system design interview, how good your design is and how well your system can scale depends very much on the database which you have used. Functional requirements of a system can be fulfilled by using any database. Nonfunctional requirements are impacted by the choice of database. In this article, we will see various types of database for system design, which could be used in various design scenarios.
Database Selection Depends on?
Let’s imagine a system needs to handle a specific query pattern, a certain kind of data structure(data can be ordered or unordered), or a certain scale to handle. There are various sets of databases designed to meet various needs of this nature. So, based on your choice of a database would impact how well your design could scale up to the requirements that are given as part of your non-functional requirement.
In this article, we will discuss a few typical use cases of databases in various contexts that you might come across in your system design interview.
Types of database
Let’s see different types of database for system design.
Caching in terms of hardware refers to the use of specialized memory components to store frequently accessed data closer to the processor or other components, thereby improving overall system performance.
Whichever system you design, you would definitely need caching. There are several use cases for caching. For instance, let’s you are querying a database and you don’t want to query a database a lot of time, then you could cache a value in the cache.
Alternatively, if you are making a remote call to a different service that is having a high latency, then you might want to cache the response of that system locally at your end. And there could be lots of other use cases for caching.
How does caching work?
Let’s say you have a key and a value. The key normally is the where clause condition in a query or whatever your query param or your request param when you are making an API call and the value is the response that you expect from the other system. So, all these values can be stored as key-value pairs.
Commonly used caching solutions are Redis, etcd, Memcached, and Hazelcast.
Let’s look at some of the file storage options. Assume you are designing a system like Amazon, where a buyer would upload images and videos of the product. Or some system like Netflix which has movies and you need a storage that supports videos.
So, whenever you want to design a system that has a requirement to store videos/images, there we can use something called Blog Storage(which refers to the mechanism or infrastructure used to store and manage blog content). Blog Storage is not really a database. A database is fundamentally intended for items that can be queried. An image or video file is not something you typically inquire on. You will just server it(file) as it is. For that requirement, we use Blog storage.
One of the example of Blog storage is Amazon S3. So, whenever there is a need for a system design that requires storing images/videos, you could use Amazon S3 as your data store.
Content Delivery Network
CDN stands for Content Delivery Network. It is a geographically distributed network of servers and data centers that work together to deliver web content to users based on their geographic location.
When a user requests content from a website, the request is routed to the nearest server in the CDN instead of the website’s origin server. A CDN reduces latency and network congestion, resulting in faster content delivery.
Along with Amazon S3 (blog storage), you might need to use CDN(Content delivery network). CDN is generally used to distribute the content (video/image) geographically across different locations. For instance, you have a product image stored in Amazon S3 as a primary data source. There are a lot of people across the globe trying to access a product image. So, you might want to distribute that image to various servers across the globe. So, individual people can query them in a much faster way compared to querying in Amazon S3 which is located in a couple of locations.
Let’s see an example of how we can implement Amazon S3 + CDN for blog storage.
Text Search Engine
Let’s say you are building some product like Amazon (or Netflix). Where you need the text searching capability on various products. The seller has uploaded a product with some title and description and you would like to display this product based on some search query. The search would be based on the text of the title and description.
A common implementation of the text search engine is provided by Elastic Search and Apache Solr. Both of them are built on top of Apache Lucene(a powerful open-source search library).
Let’s say a user is searching for cookeis(typo). If the application does not display products related to cookies, then it might be a bad user experience right. So, you need your database to be able to figure out that the user might be looking for cookies.
How does the database identify? Well, cookeis can be converted to cookies by changing 2 characters. So the edit distance is 2.
You can provide your database with the level of fuzziness it supports.
Whenever you need text searching capability you can use Elastic Seach or Apache Solr. One important thing to note is these are not databases. These are search engines. So, the difference between them is whenever you write something in a Database, the Database gives you a guarantee that data would be lost. But these data stores (Solr and Elastic Search) don’t give you such a guarantee. They claim that they give a good amount of redundancy and availability, but potentially the data could be lost.
So, your primary data store should be some database and along with that, you could load the data in text search engines(Elastic Search or Solr) to provide the text searching capability.
Time Series Databases
Let’s say we have a matrix format data to be stored in our database. Consider you are building a system similar to Graphite, Grafana, or Prometheus which is basically an application metrics tracking system. Suppose that the use case that you are given is a lot of applications are pushing metrics related to their throughput, CPU utilization, latency, etc. And you want a system to support this functionality. So, in such scenarios, we can use Time Series Database.
Time Series Databases are designed to store and analyze data that changes over time. Simply, we can say they are regular databases with a strong focus on capturing and organizing data that is time-stamped. You can use this database when you are required to handle the data that is collected at regular intervals, such as Stock prices, sensor readings, or weather monitoring. A time series database provides optimized storage and indexing mechanisms so that it can help you with fast retrieval and analysis of data based on time. It can also be used for prediction models by analyzing historical data.
You can think of a time series database as an extension of the relational databases but with not all the functionality and certain additional functionality. Regular relational databases have the ability to update and query a lot of records. But when you are building a metrics monitoring kind of system, you would never do random updates.
You would always do a sequential update in append-only mode. Let’s say you put an entry at time T1, the next entry would be at time T2, which is greater than time T1.
Also, the read queries you do, are kind of bulk-read queries with a time range. You might query a few last minutes of data or a few hours of data. But you don’t do random read or random update. Time series databases are optimized for this kind of query pattern and input pattern.