In the recent SIGMOD, we presented our work on a delete-aware LSM-based system called Lethe. In this post, I would like to share a few details that could not make it to the talk along with some thoughts based on the feedback we received during the conference.
First, this year is probably the first time that deletes were discussed as a key research challenge, as part of database systems offering privacy through deletion. Specifically, Mike Stonebraker discussed it during the very interesting panel “The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?“. Along with our work on deletes for LSM-trees, this is probably the first time that deleting data becomes a first-class citizen. Up to now, we mostly worried about how to keep all the data we collected or generated. Today, we also have to worry about how to efficiently delete them if we have to (e.g., because of GDPR).
When studying the delete operator of modern data stores, we realized that virtually all deletes are out-of-place, even in the so-called in-place systems. For example, even when deleting in a B+ Tree or a heap file (which uses slotted pages) the most common approach is to invalidate the entry via a valid bit and leave the invalid data there for future “cleaning”. To highlight this point, consider that often practical implementations of B+ Trees in full-blown relational systems do not even perform the merge after the case of underflowing due to a delete, because they expect new inserts to grow the tree in the future. Of course, this approach is further pronounced in LSM trees where a delete is implemented by design via out-of-place invalidation. In other words, in systems up to now, deletes are an afterthought, something that we may have to do, but we don’t really worry too much about optimizing it (notable exceptions about optimizing the latency of bulk deletes are “Efficient Bulk Deletes for Multi-Dimensional Clustered Tables in DB2“, “Efficient bulk deletes in relational databases” and “Online Bulk Deletion“, however, they still do not focus on other aspects of deletes like privacy and space amplification). In addition to the efficiency of deletion and privacy, the new ubiquitous use of the cloud for data management poses a space management challenge. If one owns a disk, then invalidating data does not cause extra costs unless they have to buy new disks. On the other hand, if one rents storage from a cloud provider, keeping around invalidated data for long leads to unnecessary storage costs.
Moving back to performing out-of-place deletes, we observe that essentially, the purging of the data (which we call persistent deletion) is left as a task for later which will be undertaken by a compaction/garbage collection/defragmentation process depending on the exact system used. Leaving the persistent deletion for later may lead to a significant increase of space amplification and may expose sensitive deleted data (which technically should not be accessible after their deletion) increasing the privacy risk. These risks and costs are further pronounced in LSM systems where the garbage collection process is part of the original design.
Diving into more details about deletes on LSM-based data stores (LSM was originally introduced by ONeil in 1996, popularized by Google’s Bigtable and LevelDB, as well as Facebook’s RocksDB – a LevelDB fork, and since then is being used by a variety of systems – a great survey is available by Luo and Carey), we first realized that an LSM can receive four different types of deletes:
- a point delete on the sort key of the LSM tree, which we call a primary point delete
- a range delete on the sort key, or a primary range delete
- a point delete on another attribute which is part of the value (or sometimes the whole value), which we call a secondary point delete
- a range delete on this other secondary attribute, or a secondary range delete
When a workload has a non-negligible amount of deletes, the room for improvement is actually rather large. Note that any out-of-place system is already performing the (logical) delete very quickly simply by inserting an invalidating entry (also called a tombstone), as also discussed at a recent post from Mark Callaghan, which also raised some important questions about LSM read performance in the presence of tombstones. So, our focus for our work on Lethe is on the following challenges:
- what is the cost in terms of latency or write amplification of persisting a logical primary delete? how can we minimize that?
- what is the benefit in terms of space amplification if we persist the primary deletes faster?
- how we should physically organize the data in the presence of secondary range deletes?
- can we guarantee persistent primary or secondary deletion in a given deadline?
- how is read performance impacted by addressing these challenges?
The above questions are discussed in Lethe, however, there are several open challenges. We are now considering some additional challenges, including:
- Secondary point deletes in LSM trees. This seems to be a much harder problem as unless we have a secondary index, there not seems to be a way to avoid reading the whole tree.
- Read queries at the presence of primary range deletes. After looking at the internals of systems like RocksDB, we see that range tombstones are significantly impacting the read path of an LSM tree leading to (potentially) avoidable I/Os from the storage simply to verify that an entry has been deleted.
Stay tuned for more work on efficient and persistent deletion!
=======
Edit: added a reference to a particularly interesting recent blog post from Mark Callaghan‘s blog.