Buried low in the software stack of most applications is a data engine, an embedded key-value store that sorts and indexes data. Until now, data engines—sometimes called storage engines—have received little focus, doing their thing behind the scenes, beneath the application and above the storage.
A data engine usually handles basic operations of storage management, most notably to create, read, update, and delete (CRUD) data. In addition, the data engine needs to efficiently provide an interface for sequential reads of data and atomic updates of several keys at the same time.
Organizations are increasingly leveraging data engines to execute different on-the-fly activities, on live data, while in transit. In this kind of implementation, popular data engines such as RocksDB are playing an increasingly important role in managing metadata-intensive workloads, and preventing metadata access bottlenecks that may impact the performance of the entire system.
While metadata volumes seemingly consume a small portion of resources relative to the data, the impact of even the slightest bottleneck on the end user experience becomes uncomfortably evident, underscoring the need for sub-millisecond performance. This challenge is particularly salient when dealing with modern, metadata-intensive workloads such as IoT and advanced analytics.
The data structures within a data engine generally fall into one of two categories, either B-tree or LSM tree. Knowing the application usage pattern will suggest which type of data structure is optimal for the performance profile you seek. From there, you can determine the best way to optimize metadata performance when applications grow to web scale.
B-tree pros and cons
B-trees are fully sorted by the user-given key. Hence B-trees are well suited for workloads where there are plenty of reads and seeks, small amounts of writes, and the data is small enough to fit into the DRAM. B-trees are a good choice for small, general-purpose databases.
However, B-trees have significant write performance issues due to several reasons. These include increased space overhead required for dealing with fragmentation, the write amplification that is due to the need to sort the data on each write, and the execution of concurrent writes that require locks, which significantly impacts the overall performance and scalability of the system.
LSM tree pros and cons
LSM trees are at the core of many data and storage platforms that need write-intensive throughput. These include applications that have many new inserts and updates to keys or write logs—something that puts pressure on write transactions both in memory and when memory or cache is flushed to disk.
An LSM is a partially sorted structure. Each level of the LSM tree is a sorted array of data. The uppermost level is held in memory and is usually based on B-tree like structures. The other levels are sorted arrays of data that usually reside in slower persistent storage. Eventually an offline process, aka compaction, takes data from a higher level and merges it with a lower level.
The advantages of LSM over B-tree are due to the fact that writes are done entirely in memory and a transaction log (a write-ahead log, or WAL) is used to protect the data as it waits to be flushed from memory to persistent storage. Speed and efficiency are increased because LSM uses an append-only write process that allows rapid sequential writes without the fragmentation challenges that B-trees are subject to. Inserts and updates can be made much faster, while the file system is organized and re-organized continuously with a background compaction process that reduces the size of the files needed to store data on disk.
LSM has its own disadvantages though. For example, read performance can be poor if data is accessed in small, random chunks. This is because the data is spread out and finding the desired data quickly can be difficult if the configuration is not optimized. There are ways to mitigate this with the use of indexes, bloom filters, and other tuning for file sizes, block sizes, memory usage, and other tunable options—presuming that developer organizations have the know-how to effectively handle these tasks.
Performance tuning for key-value stores
The three core performance factors in a key-value store are write amplification, read amplification, and space amplification. Each has significant implications on the application’s eventual performance, stability, and efficiency characteristics. Keep in mind that performance tuning for a key-value store is a living challenge that constantly morphs and evolves as the application utilization, infrastructure, and requirements change over time.
Write amplification is defined as the total number of bytes written within a logical write operation. As the data is moved, copied, and sorted, within the internal levels, it is re-written again and again, or amplified. Write amplification varies based on source data size, number of levels, size of the memtable, amount of overwrites, and other factors.
This is a factor defined by the number of disk reads that an application read request causes. If you have a 1K data query that is not found in rows stored in memtable, then the read request goes to the files in persistent storage, which helps reduce read amplification. The type of query (e.g. range query versus point query) and size of the data request will also impact the read amplification and overall read performance. Performance of reads will also vary over time as application usage patterns change.
This is the ratio of the amount of storage or memory space consumed by the data divided by the actual size of the data. This will be affected by the type and size of data written and updated by the application, depending on whether compression is used, the compaction method, and the frequency of compaction.
Space amplification is affected by such factors as having a large amount of stale data that has not been garbage collected yet, experiencing a large number of inserts and updates, and the choice of compaction algorithm. Many other tuning options can affect space amplification. At the same time, teams can customize the way compression and compaction behave, or set the level depth and target size of each level, and tune when compaction occurs to help optimize data placement. All three of these amplification factors are also affected by the workload and data type, the memory and storage infrastructure, and the pattern of utilization by the application.
Multi-dimensional tuning: Optimizing both writes and reads
In most cases, existing key-value store data structures can be tuned to be good enough for application write and read speeds, but they cannot deliver high performance for both operations. The issue can become critical when data sets get large. As metadata volumes continue to grow, they may dwarf the size of the data itself. Consequently, it doesn’t take too long before organizations reach a point where they start trading off between performance, capacity, and cost.
When performance issues arise, teams usually start by re-sharding the data. Sharding is one of those necessary evils that exacts a toll in developer time. As the number of data sets multiplies, developers must devote more time to partitioning data and distributing it among shards, instead of focusing on writing code.
In addition to sharding, teams often attempt database performance tuning. The good news is that fully-featured key-value stores such as RocksDB provide plenty of knobs and buttons for tuning—almost too many. The bad news is that tuning is an iterative and time-consuming process, and a fine art where skilled developers can struggle.
As cited earlier, an important adjustment is write amplification. As the number of write operations grows, the write amplification factor (WAF) increases and I/O performance decreases, leading to degraded as well as unpredictable performance. And because data engines like RocksDB are the deepest or “lowest” part of the software stack, any I/O hang originated in this layer may trickle up the stack and cause huge delays. In the best of worlds, an application would have a write amplification factor of n, where n is as low as possible. A commonly found WAF of 30 will dramatically impact application performance compared to a more ideal WAF closer to 5.
Of course few applications exist in the best of worlds, and amplification requires finesse, or the flexibility to perform iterative adjustments. Once tweaked, these instances may experience additional, significant performance issues if workloads or underlying systems are changed, prompting the need for further tuning—and perhaps an endless loop of retuning—consuming more developer time. Adding resources, while an answer, isn’t a long-term solution either.
Toward next-generation data engines
New data engines are emerging on the market that overcome some of these shortcomings in low-latency, data-intensive workloads that require significant scalability and performance, as is common with metadata. In a subsequent article, we will explore the technology behind Speedb, and its approach to adjusting the amplification factors above.
As the use of low-latency microservices architectures expands, the most important takeaway for developers is that options exist for optimizing metadata performance, by adjusting or replacing the data engine to remove previous performance and scale issues. These options not only require less direct developer intervention, but also better meet the demands of modern applications.
Hilik Yochai is chief science officer and co-founder of Speedb, the company behind the Speedb data engine, a drop-in replacement for RocksDB, and the Hive, Speedb’s open-source community where developers can interact, improve, and share knowledge and best practices on Speedb and RocksDB. Speedb’s technology helps developers evolve their hyperscale data operations with limitless scale and performance without compromising functionality, all while constantly striving to improve the usability and ease of use.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.
Copyright © 2023 IDG Communications, Inc.