If you think of it, data is at the heart of all the processing done by applications running on cloud infrastructure. The underlying data and its management are the reasons why so much blood, sweat, and tears are put into building any IT system.
Let us begin from the absolute basics. Answer the below questions to warm yourself up to thinking about several of its aspects.
What is data?
How is it stored?
How is it communicated?
How large is the data?
What are the types of data we deal with?
How is the data retrieved?
How is the data kept secure?
Whenever I discuss data with technical teams, I always think of it in terms of persistence. But persistence is such a vague word! I mean, even the source code of an application, OS, Kernel, BIOS and Assembly instructions have to persist somewhere, right? In a way, even the programs that we write are “data.”
Cloud-native system performance depends on data in so many ways - the way the data itself is stored, the location where it is stored, the format and protocol of its transmission, etc. Even the business-oriented aspects play a crucial role in classifying the data and imposing appropriate security restrictions, which add additional compute delay.
As far as the data at rest is concerned, the data is stored mainly in 3 ways - block, file, and object storage. Databases come under one of these ways and are not something that needs to be mentioned separately here. However, for the sake of this post, we discuss databases as well in greater depth.
This is part of the series “Cloud-Native System Performance” where I cover some of my approaches to enhancing a system deployed on the cloud. If you haven’t read them yet, please find the links to them below:
Cloud-Native System Performance - Better Approaches To Storage & Memory (This post)
I have compiled this series into a FREE eBook with more details and deeper insights. Link below! (PDF)
The purpose of this section is to provide you with a quick primer on the types of storage so that we understand the context better.
Think of it as an external hard drive, but unformatted. The data is stored in the blocks of memory, and it is possible to connect this type of storage to multiple compute nodes in the network. This makes the data available for simultaneous access.
The block-level storage stores the data in raw format. However, it is possible to format this data store to be used to store files and host databases.
Also known as file-based storage, it offers storage solutions that are formatted to host a specific filesystem (NTFS, NFS, etc.). File-based storage makes it possible for the users to understand the storage function better as it provides the tree structure for storing data.
The tree structure contains files and folders. Folders may have additional folders within them, as well as files. This creates a unique path for every file and it becomes easy to identify them. Of course, file storages work well with applications as well, for read and write access to files.
A relatively new type of storage, the object store is truly scalable and efficient. Applications requiring frequent changes to the data use object-store. Data is stored in the form of objects and associated metadata. These objects are highly scalable.
Metadata associated with the object provides additional information about the object as compared to other types of storage. This feature opens up the possibility of maintaining the data in a much more efficient way by reducing the direct reads of the object themselves.
A lot of literature is available on the internet to understand the key differences between these storage types. Even the major cloud providers offer multiple services as far as storage is concerned, and they have documented a lot of aspects with use cases in mind. It is really helpful to go through them as well.
Since we are dealing with cloud-native infrastructure, I would limit this section to talking about two main memory types - RAM and Cache.
Applications and the data that they process are usually copied over to memory first. This is a short-term non-persistent type of memory used by applications to perform faster IO on data. All the calculations and logic operations happen here before they are written to any database or storage locations.
Applications make use of many variables to support their logic. These variables are pointers to the memory locations within this memory. You must have heard about heap memory, garbage collectors, etc. - these terms are used in the context of allocation and unallocated memory within this RAM. We will take a look at these in greater detail later in the post.
Databases play a major role when it comes to persisting the data in a structured format. It is possible to think of structured data in the form of tables, rows, and columns. The applications running databases enable quicker ways of accessing the data than accessing it from a file.
Relational databases provide a much-needed baseline for application logic to query the required data in a pre-programmed format. Relational databases use SQL - a language to CRUD data in a database. Non-Relational databases are a new addition, in that we have collections, documents, and attributes. Both of them have their own pros and cons.
When it comes to performance - applications reading and writing data to and from the database poses several bottlenecks. In the next section, we will address a few of them.
IO operations to increase the response time of the system.
Slower disk IO can be attributed to multiple factors - availability of connections in the pool, access patterns, and even the underlying hardware. This in itself is a very common issue faced by many teams.
The very identification of this is a great achievement. When we look at the big picture, the cause for performance could be any. Having proof that establishes the fact that the disk IO is causing a bottleneck is a win in itself.
Once this is confirmed, there are various ways described here that can help resolve this issue.
Infrequent and random access to commonly read data from a large table results in performance lag.
Whenever you execute a simple read query on an unindexed table, it scans the entire table row by row until it finds the matching WHERE clause condition. This process is expensive on time and resources.
Indexes provide a sequential identifier for each row for a given table. We can make use of index values to pinpoint to a specific row from a database to read the data quickly. This avoids full scan of the table.
The system is unable to serve multiple parallel requests, due to a bottleneck at database connection.
When an application performs a database transaction, it is fairly an expensive task. The request first asks to open a connection to the database, it is then authenticated, the transaction takes place, and the connection is closed. Of course this is a simple display of what it takes for a bare minimum database transaction to happen.
Limiting the connection pool to an optimum size is the key. It should be small enough to avoid contention, but large enough to deliver system performance at acceptable norms.
Application performs complex database queries which results in slower response.
This is partly related to the database schema. If the efforts are taken to design the database schema to align with the needs of the application, this problem can be addressed.
A good strategy is to identify all the relevant data that may be required to process a particular request - as far as possible while the session is still active - and query it to be loaded in memory or to be cached. The idea is to try to avoid as many database hits to serve immediate requests.
Application fails and throws “Heap Memory Exception”.
Applications and programs make use of main memory to store variables and other data used during the runtime. The memory locations are allocated to store this variable and a pointer is assigned, which is then referred to in the program to execute the tasks.
It is important to clean these memory locations, or let the garbage collectors free up this memory space in the main memory for other threads and processes. Consider using weak references over strong references.
If the programs are facing heap memory issues, then one way to address the same is by increasing the size of main memory. However, main memory is kind of expensive.
I have highlighted some of the issues above with respect to data management for improving end-to-end system performance. The above table represents resolutions most of which are also true even in the cloud-native context. In the next section, we will take a look at some of the features provided by cloud providers to take better approaches to data management.
Read more challenges in the FREE eBook available below: (PDF)
Top 5 Cloud-Native Approaches
Selecting Right Storage
Cloud storage comes in all shapes and sizes. They don’t just offer object, block and file storage options, but they even have additional features that help you customize them more suited to your application’s needs.
Cloud providers offer scale, reliability, security, network connectivity, and much more. The pain related to the hosting is taken care of by them. Since it is cloud-native these storage options are very well integrated with their ecosystem of services.
You can easily configure a block storage with a desired size and let multiple applications hosted on various platforms - private or public - take advantage of it.
Similarly, object stores are very user and developer friendly. All the services exist and no setup time is required.
Take advantage of various purpose built databases, readily provided by cloud vendors. As established earlier, selecting the right database is very crucial for application performance. Using a traditional relational database for IoT or blockchain based applications - is a wrong choice.
Major cloud providers offer databases as a service with a catalog of databases tuned for the domains and use case. They even publish reference architectures and case studes to consolidate the industry patterns and best practices.
Compute over Storage
Wherever possible, prefer compute over database transactions. It is a proven fact that if it is possible to process the data by holding it in memory for a few milliseconds extra, it is more effective as compared to the resource consumption required for corresponding database transactions.
Also, if possible write the data back to the database in batches. This depends on how the application is developed. This requires conscious efforts on the developer's side.
In terms of databases, consistency defines if the immediate reads are up-to-date with the latest written information. Applications that perform many IO tasks to the databases have the tendency to use locking mechanisms to make sure multiple intended writes happen in a consistent way.
However, when such frequently changing data is being read, especially where multiple reads are allowed, the data can either be consistent with the latest write or not. When multiple readers are enabled, in some cases the database is also replicated to have a real-time or near real-time copy of the primary database; the purpose of which is to serve all the read requests.
In such cases if the consistency is desired, then that adds to the performance lag as all the read operations have to be suspended until the write operations are replicated on all the copies of the database.
Identify the cases where eventual consistency is tolerated, and get rid of strong consistency in those cases. This would help improve performance. Cloud native solutions provide this ability, apparently eventual consistency also helps with cost optimizations.
Partitioning a storage or a database offers a clear advantage on performance. Think of it as a high level classification of the data itself, that is stored in large volumes.
When a query is run against these volumes, chances are that the entire database or the storage volume is scanned to fetch the requested data. Partitioning reduces the scope of this scan, thus improving the response time. A good partitioning strategy is defined based on the stored data itself.
Cloud platforms provide various services as far as the partitioning of the data is concerned. As opposed to the traditional ways, all the partitioning needs are handled by the platform itself once configured. So consider partitioning the data.
In this blog post we have looked at a few challenges and cloud-native approaches to improve performance from database and memory management perspective. In the next post we would take a look at how networking can be improved, to enhance the overall cloud-native system performance.
I have compiled this series into a FREE eBook with more details and deeper insights. Link below! (PDF)