DiSC at SIGMOD 2024

Last week in #SIGMOD2024, #DiSC participated with two research papers in the main conference, two workshop papers in #DBTest, and one invited talk in the 20th instance of #DAMON.

Papon presented our (joint with Taishan Chen and Shuo Zhang) work on “CAVE: Concurrency-Aware Graph Processing on SSDs” that proposes new variations of popular graph traversal algorithms that are designed to better exploit the parallelism that modern SSDs can natively support. We apply our techniques on single-node out-of-core systems and we show up to two orders of magnitude performance improvement over the state of the art. The core idea is to identify subgraphs that can be traversed in parallel (for DFS-style traversal) or “fronts of nodes” that can all be accessed in parallel (for BFS-style traversal). Throughout this work, it is crucial to carefully tune our system to use the parallelism of the device without overburdening it with too many concurrent accesses, as we have shown previously.

Zichen presented our (joint with Xiao Hu) work on “NOCAP: Near-Optimal Correlation-Aware Partitioning Joins” that proposes a new disk-based row-oriented join algorithm that uses the join correlation to calculate the optimal partitioning strategy to minimize the I/O cost (taking also into account the potential asymmetry between reads and writes on the underlying SSD-based storage). The core idea of this work is to partition tuples using the join correlation so as to group together rows with a similar number of join matches. We prove that this is the optimal strategy, and we further show that it outperforms the state of the art (Dynamic Hybrid Hash Join) for any memory budget! Finally, we develop a practical algorithm that can maintain an arbitrarily small memory budget while still outperforming the state of the art. The paper has most easter eggs like a simple skew-aware hash-based join partitioning strategy that we term “Rounded Hashing” that avoids repartitioning and offers better join performance when skew creates partitions larger than the available buffer space.

Zichen also presented our (joint with Arpita and Subhadeep) work on “KVBench: A Key-Value Benchmarking Suite” at #DBTest, a novel synthetic workload generator that fills the gap between classical key-value workload generators and more complex real-life workloads. KVBench supports a wide range of operations, including point queries, range queries, inserts, updates, deletes, range deletes, and among these options, inserts, queries, and updates can be customized by different distributions. In this paper and its artifacts, we share the benchmarking tool that we have been using in a series of papers on key-value stores.

Aneesh presented our (joint with Andy and Jinqi) work on “Benchmarking Learned and LSM Indexes on Data Sortedness” at #DBTest, where we experiment with two update-friendlylearned indexes (ALEX and LIPP) and a classical write-intensive index (LSM-tree) with different levels of data pre-sortedness for ingestion. For the last few years, we have been studying the impact of data sortedness on indexes (e.g., “Indexing for Near-Sorted Data”), and in this work, we treated the three benchmarked approaches as black-box systems, and we wanted to figure out how well they can adapt to sortedness. Needless to say, there is room for improvement in all systems and a surprising new result: when comparing two approaches for different levels of sortedness, the comparison may vary dramatically. For example, LIPP can be anywhere between 4x faster and 1.9x slower than ALEX.

Lastly, during #DAMON, I presented a fresh thinking talk with the title “Effortless Locality Through On-the-fly Data Transformation”. The talk summarized our new exciting research project on software/hardware co-design, where we build an on-the-fly data transformation engine that sits between the CPU and the memory to ensure that only the ideal data layout (and data elements) are propagated through the cache hierarchy to the CPU. For relational data systems, this allows for the ideal hybrid layout to be available for each query, grossly simplifying the data system software complexity while offering ideal data locality at all times. For other data-intensive applications (e.g., managing tensors), this ensures effortless locality on selective data accesses (accessing slices or fibers of tensors). This talk is based on a research project we started with Renato Mancuso that has led to a series of papers over the past few years: “Relational Memory: Native In-Memory Accesses on Rows and Columns“, “Hardware Data Re-organization Engine for Real-Time Systems“, “Transparent Data Transformation“, “Relational Fabric: Transparent Data Transformation“, “Software-Shaped Platforms“, and “On-the-fly Data Transformation in Action“.

Interview by The Brink

A few weeks ago, I was interviewed by Jessica Colarossi for The Brink, a BU publication that delivers the latest news from Boston University research. The full piece can be found here. It covers the research we conduct in the DiSC lab, as well as the research of one of my colleagues, Masha Kamenetska, who holds a joint appointment in the departments of Chemistry and Physics, both recently funded by an NSF CAREER award.

In the process of this interview, we had an interesting exchange that I would like to share. The full exchange follows:

Can you briefly explain the work you’re doing with the [NSF CAREER] award funding? 

This NSF CAREER grant will have a transformative effect on the way we build and tune data systems. We will develop a new breed of data systems that can offer near-optimal performance despite potential uncertainty in the execution setting and increase performance predictability. We now use public (and private) cloud for most of our computing. As a result, our applications face interference from other applications, leading to various forms of workload and resource unpredictability. This proposal, at its core, aims to build data systems that can address this unpredictability. 

In addition to research, this proposal will help the Data-intensive Systems and Computing Lab at the CS department (https://disc.bu.edu), which I founded in 2019, to establish a strong educational and outreach impact. The funds already support a number of graduate students, the development of new CS classes and educational material, and targeted internship programs. 

How are log-structured merge (LSM) based data systems used currently, and what problems will you and your team be addressing? 

In this project, we work on LSM-tree-based key-value stores, a class of data systems used by a wide variety of SQL and NoSQL applications supporting social media platforms, data-intensive applications like accumulating sensor or e-commerce data, and more. A very influential LSM-based system (but hardly the only one) is RocksDB, an open-source key-value store currently developed by Meta and used both at Meta and Facebook as part of their infrastructure. Note that several data-intensive workflows and systems outside Meta also use RocksDB. LSM-tree stands for Log-Structured Merge tree and offers a tunable balance between optimizing for fast reads and fast writes (updating existing or storing new data). When deploying LSM-based systems in the wild, we use information about the application to tune them better to offer the best possible performance. However, as data-intensive applications are increasingly being deployed in private and public clouds, our assumptions about the application properties (e.g., workload characteristics, access patterns, read/write ratios) and the available resources (e.g., amount of memory, quality of service of the available hardware) come with a degree of uncertainty. Because of that uncertainty, it is becoming harder to tune LSM-based data systems. In the project, we capture and model uncertainty by innovating in the tuning process to offer optimal (or close-to-optimal) performance of our data systems even when the workload they face departs from the one they initially expected. This will require a concerted effort to fuse expertise and techniques from computer and data systems, algorithmic data mining and machine learning, and low-level hardware understanding. I am fortunate to be part of a department that offers a fertile environment to cultivate those connections.

Why do you think it’s important to upgrade and optimize these data systems? 

The new breed of data systems we envision will require less human intervention. As a result, they will be more appealing to users and administrators, allowing organizations to widely deploy reliable systems to accelerate and support data-intensive scientific discovery and applications. Such ease of deployment is timely as data management is increasingly becoming an automatic service, where human experts cannot always be in the loop to hand-tune and optimize data systems.

How did you become interested in data systems and computer sciences? 

My interest in computer science dates back to when I was a grade school student. As far back as I remember, I was fascinated by computers and the creativity that came with them. When I wrote my first program in 5th grade, I felt I had just created "something out of nothing". This feeling followed me throughout college when I was really interested in building computer systems. During college, I was also drawn to data management because its principles were built on logic and helped me make sense of the world around me. Those two topics - data management and computer systems - are the foundations of the research we conduct in the lab. We want to build technological tools to manage and make sense of data - and thus the world around us.

By the end of the five-year grant, what do you hope to accomplish?

The main scientific goal of this grant is to develop the data systems tools needed to address uncertainty in the workload and the available resources. Our goal is to push the state of the art of data systems by making our findings and contributions part of practical systems. To that end, we already apply our ideas to practical systems, and in addition to our scientific publications, we make all our artifacts (code, data, scripts) public. Research, however, is only one side of it. Every grant, especially an NSF CAREER grant, also focuses on education and outreach. The way I prefer to see it is to focus on people. One of my main goals is to help develop a new generation of data systems experts, both at the Ph.D. level and the undergraduate level, who will enjoy the process of building systems (and learning in general), and they will also acquire technical and societal skills that will help them both to live a fulfilling life and tackle significant real-life problems down the line. This is what I see as the most important metric of success.

A few thoughts on small sample statistics for systems research

When doing systems research, a large chunk of making our case about the benefits of a new system or a new feature is the presentation of our experimental results.

In turn, in order to create our graphs we run experiments (e.g., based on benchmarks) and we present various metrics (e.g., latency, throughput). In some cases we face the dilemma of how many times we should run an experiment to have an authoritative average of the metric we want to report.

In order to consider this question a bit deeper, we have to think how each point of our graph is being generated. For example, consider a plot that presents the latency of 4 different (let's say TPC-H) queries. How many times should we run each query before we present numbers (e.g., latency average)? What is the standard deviation of the runs we actually did? Is 5% a good number? Should we show the average or the median?

There is no one-size-fits-all answer for these questions. First, we should go back to some probability theory and statistics.

The act of running the experiment multiple times to find out the "correct" value of the metric, implies that what we measure behaves like a random variable, which means that we also make some (potentially silent) assumptions that this random variable follows a specific distribution (e.g., normal). Based on such an assumption, we can now start answering the above questions. Note that even if a random variable does not follow the normal distribution, most of the basic concepts hold as long as the distribution has one peak and it is symmetric (in other words, it looks like a normal distribution). An extra benefit of a symmetric distribution with a single peak is that the average and the median are the same.

From the central limit theorem, we know that if we "run our experiment" enough times, the average of those times will converge to the average. How many is enough? A good rule of thumb is more than 30. In fact, if we have 30 sample values of our random variable (that follows a normal distribution), the real average, say μ, is with probability 99% in the interval (X-0.5*S, X+0.5*S), where X is the experimental average, and S is the experimental standard deviation. If we are ok to relax the probability to 95%, then the interval becomes narrower: (X-0.37*S, X+0.37*S). Of course, the usefulness of this confidence interval depends on the relative values of X and S (which depend on the nature of the observed metric).

The discussion above focuses on 30 repetitions of the experiments. What if each repetition is expensive (in terms of time, or even monetary cost)? How do these confidence intervals change with a smaller number of repetitions? In order to answer this question, we need to shift our attention to small sample statistics. A good primer can be found in https://saylordotorg.github.io/text_introductory-statistics/s11-02-small-sample-estimation-of-a-p.html (published by Saylor Academy, 2012). The core result is that given a number of repetitions (termed population) N, the average μ of the random variable in question will be with probability P in the interval (X-T(P,N)*S/sqrt(N), X+T(P,N)*S/sqrt(N)), where X is the experimental average, S is the experimental standard deviation, and T(P,N) is a term that depends on the target probability we want μ to be in the above interval, and on the size of the population (i.e., the number of times we run our experiments).

Now, assume that we want to run an experiment and we need to spend a warmup budget W, and execution cost C per execution. The total benchmarking budget is W+N*C. The benefit from increasing N is that we can provide more accurate calculation of the actual average μ, and the downside is higher overall benchmarking cost. To use some more numbers consider that we run N=1, N=2, N=5, or N=10 times and that we want to have confidence 95% that μ is in the confidence interval. In these four cases, the confidence intervals will be:

  • For N=1: (X-12.706*S, X+12.706*S)
  • For N=2: (X-3.043*S, X+3.043*S)
  • For N=5: (X-1.150*S, X+1.150*S)
  • For N=10: (X-0.704*S, X+1.704*S)

The intervals with 99% confidence would be:

  • For N=1: (X-63.657*S, X+63.657*S)
  • For N=2: (X-7.018*S, X+7.018*S)
  • For N=5: (X-1.803*S, X+1.803*S)
  • For N=10: (X-1.002*S, X+1.002*S)

Note that in several practical use-cases confidence intervals of 95% are considered to be adequate, however, the usefulness of the intervals also depends on the measured values X and S.

In this brief note we discuss how to quantify the impact of reducing the number of experiments to calculate statistics. Another relevant question, left for future discussion, is how to sample from a big population (say, latency of short transactions) so that we do not have to pay the cost of storing/updating the value of the metric too often.

DBrainstorming: a new column at ACM SIGMOD Record

As 2021 is about to end, I am reflecting on the first year of the nascent ACM SIGMOD Record "DBrainstorming" column. Earlier in 2021, Rada Chirkova, the SIGMOD Record Editor-in-Chief who has assembled an excellent "cast" of associated editors, reached out to discuss the creation of a new column to discuss hot topics (about all sorts of thought-provoking ideas from the DB community). After a short debate on the name of the new column, we converged to "DBrainstorming" and we started to solicit the first contributions. In 2021, SIGMOD Record has hosted three instances of the column, in the June (on software/hardware co-design), September (on tuning using NLP), and December (on video analytics) issues.

The column aims to host more contributions with thought-provoking ideas, exciting new research directions, or radically new approaches we should be taking in the DB community. While the first three columns focus on new research directions we also welcome contributions on teaching, industry topics, research methodology, reproducibility, and other topics of DB interest.

Stay tuned!

A new proposal for the SIGMOD reviewing process

In the data management community, we have been lucky to have motivated and conscientious leadership over the years which has led to significant improvements in the review process of our conferences, like SIGMOD, VLDB, ICDE, and EDBT.

One of the changes a few years back was the introduction of the feedback phase during which the authors have the opportunity to see the reviews without the scores and provide comments on the main points raised in the reviews. While this is frequently a very useful resource for the review process, it can also become a bottleneck in some cases, especially if the reviewers have clearly decided to accept or reject a paper. In such cases, the feedback phase prolongs the review period by 1-2 months and requires the authors to spend a week on writing a short document (à la "revision document" without the revision) which is not really impacting the review process. And while most reviewers are also conscientious and diligently read the feedback, in some cases the feedback it is not even read.

So, should we get rid of the feedback phase altogether?

No, I agree with the previous leadership of SIGMOD that the feedback is a valuable resource. However, I have a series of small amendments to propose to make it (even) more useful.

First, request from the reviewers to try and agree on a decision as early as possible, and term those "Early Accept""Early Reject", and "Early Revision". The first attempt in that direction happened in the reviewing process of ICDE 2022 with "Early Accept" and "Early Reject". The rationale is that it is ideal to allow the process (and the researchers) to focus on the important things (doing research) as soon as possible. If one of those three decisions is made, no feedback would be required.

Second, if the reviewers cannot finalize their decision, then the reviewers can ask specific questions on the submitted paper (for example, up to 7 such questions) for which the authors have to provide concise timely responses (e.g., 500 words for each question) in order to help the reviewers reach a conclusion, and do so quickly. The rationale is that while generic feedback about, potentially, all comments might not be always important or even needed in the review process, targeted feedback to answer questions posed by the reviewers, will significantly accelerate and increase the quality of the review process.

Overall, this proposal aims to (i) help have a faster turnaround in the review process, (ii) reduce review and submission fatigue, and (iii) help the authors focus on doing research.

Predeclaration Locking (aka Conservative 2PL)

A few days ago, I was discussing in class the concept of deadlocks as part of a transaction management lecture. I presented a scenario in which two very simple transactions caused a deadlock (T1 locks A, T2 locks B, and then T2 attempts to lock B and T1 attempts to lock A), and I asked the class whether they had any ideas on how to avoid such a conundrum.

Immediately one of my students came up with the idea of acquiring all locks at the beginning of the transaction. The idea, of course, works and ensures that we have no deadlocks. In fact, it is called Conservative Two-Phase Locking (or C2PL), and it was earlier termed Predeclaration Locking in the 1985 ACM TODS paper on "Locking performance in centralized databases" (thanks, Dennis, for pointing me in the right direction!).

However, the research papers that mention either term are authored in 2000 or earlier. So, why recent research has not been considering C2PL?

After constructing a few examples, it quickly became clear.

First, let's clarify that indeed C2PL always avoids deadlocks. Since all locks are acquired at the beginning of each transaction, it is not possible to be in a situation where a transaction has acquired a lock and waits for another one. It either has all locks or none. Nevertheless, we do have to pay attention to the implementation. If each lock acquisition is a separate step, we can still have a deadlock unless we release all locks in case we cannot acquire them all in a very small predetermined timeout.

With the deadlock avoidance out of our way, we now discuss the drawbacks of C2PL. The most important one is that C2PL requires the knowledge of all the locks a transaction might need. While this might seem possible for simple transactions that change the values of a few objects, it quickly becomes a hard task if a transaction has the simplest if-statement. Consider  the following three examples:

Transaction T1 Transaction T2 Transaction T3
Read (A) Read (A) Read (A)
If (A>10)
Read (B)
B:=B+5
Write(B)
Endif
If (A>10)
Read (B)
B:=B+5
Write(B)
Endif
If (A>10)
Read (B)
B:=B+5
Write(B)
Endif
If (A<=5)
Read (B)
B:=B+12
Write(B)
EndIf
If (A<5)
Read (B)
B:=B+12
Write(B)
EndIf
If (A<5)
Read (C)
C:=C+12
Write(C)
EndIf
Commit Commit Commit

In the first case, simply by observing the code, we know that we will only need to get a shared lock on A and an exclusive lock on B. So, we can acquire them at the beginning of the transaction and avoid deadlocks.

In the second case, we see that we will need to get a shared lock on A and (if A>10 or A<5) an exclusive lock on B. In other words, we would not need an exclusive lock for B if 5≤=A<10. Getting a lock on both A and B might prove overly conservative and it would reduce opportunities for concurrency if 5A<10.

For the third case, we need a shared lock on A, and depending on the value of A, we might need an exclusive lock on B (if A>10), an exclusive lock on C (if A<5), or no other lock if 5A<10. Getting a lock on all three A, B, and C leads to even more restriction of concurrency than before.

Overall, the main problem of C2PL is that it has to make a decision at the very beginning before it knows the decision of each if-statement. As a result, the only decision it can make is to always lock every possible object. For T1, acquiring both S(A) and X(B) is ok, however, in the case of T2, acquiring X(B) might prove to be unnecessary depending on the value of A. Grabbing X(B) does not impact correctness, but it stops from running other transactions that want to access B. Similarly, T3 has to grab S(A), X(B), and X(C). We know that one of the two locks is not needed but we do not know which one before we read A (and hence before we acquire S(A)). To sum up, C2PL sacrifices too much concurrency to avoid deadlocks. 

So, one can think that we can increase concurrency by releasing the locks as early as possible and, as a result, gradually. While this will somewhat increase concurrency, it is opening new cans of worms: cascading aborts or even non-serializable transaction execution.

Consider a transaction T4 that acquired all locks at the beginning, and after finishing updating object A, it immediately releases the lock of A. If T4 later aborts and another transaction T5 reads the updated value of A, it would have read a value that was never committed, and hence, it should be as if it never existed. In order to avoid this inconsistency, T5 has to be aborted as well, leading to a cascading abort. If T5 has committed, however, then things are even worse: T5 committed using data that should never have been used. The result of the execution now is not equivalent to any serial schedule. Note that this schedule would be a non-recoverable schedule. A recoverable schedule is one that transactions commit only after all transactions whose changes they read commit.

If we focus only on the cascading abort problem, things can also get more complicated. In order to abort, T4 has to reacquire the lock of the previously unlocked object A to change it back to its older value. By attempting to reacquire the lock, we now face the possibility of having a deadlock, which was the very problem we wanted to avoid in the first place.

To avoid all these problems, we can release all the locks at the end, having a Strict Conservative 2PL. This has no deadlocks, no cascading aborts, and the allowed schedules are always serializable and recoverable, but  it significantly reduces concurrency.

Overall, there is no perfect solution.

Simple Two-Phase Locking (gradual lock acquisition and gradual lock release) ensures conflict serializability and guarantees Isolation if no transactions fail (i.e., if there are no aborts). However, it can lead to deadlocks and uncommitted reads (that may lead to cascading aborts and unrecoverable schedules non-serializable execution).

Strict 2PL (gradual lock acquisition and late lock release at the very end) avoids uncommitted reads at the expense of reducing concurrency but does not avoid deadlocks.

Conservative 2PL (early lock acquisition of all locks at the beginning and gradual lock release) avoids deadlocks at the expense of reducing concurrency but does not avoid uncommitted reads (that may lead to cascading aborts and unrecoverable schedules). 

Strict Conservative 2PL (early lock acquisition of all locks at the beginning and late lock release at the very end) avoids both deadlocks and uncommitted reads but limits concurrency even further. 

Putting everything together, Conservative 2PL (Strict or not) would often acquire more locks than needed (when the transaction has even the simplest if-statements), which is probably why we haven't seen much of it in the last twenty years!

PS: Thank you, Dennis and George, for discussing in detail the guarantees of the variants of 2PL!

Virtual lab-visits at the time of the pandemic

One of the biggest challenges during the pandemic has been to further work on the community building that typically happens through in-person live interactions in conferences, invited talks, workshops, lab visits, etc, and fuels research progress.

Most conferences quickly transformed to an online-only endeavor, which entailed a tremendous technical and organizational effort for which, I believe, everyone is thankful for the tireless work of the leadership of all our conferences (e.g., SIGMOD, VLDB, EDBT, ICDE, KDD). While here I refer to data management conferences, I am confident that the same sentiment is shared in other communities.

In the online-only version of the conferences, however, there was one aspect that was left behind. Any research community is indeed called a community because its members know each other and once a young researcher has started becoming a member of this community, she or he can seek advice or help from the human network that she or he has started cultivating. However, this community building is something very hard to achieve without direct communication (ideally in-person). In conferences, typically, a seasoned advisor often serves as the link between young students and other researchers that might help with some feedback on a project or some interesting discussion which may lead to a collaboration or even to a lifelong friendship. How are young graduate students expected to start cultivating these types of relationships in the pandemic era?

While I certainly do not have a final answer in mind, I have a proposal that I am experimenting with since Fall 2020. It is really a simple idea. For as long as traveling is virtually impossible, I am trying to have 3-4 virtual visitors in my lab per semester, and enable direct interaction between our visitor and all the lab members. I have been experimenting with a combination of Zoom and Gather (which was also used by SIGMOD and VLDB among other conferences). The latter not for its (not) superior video quality, but for its casual look-and-feel, which allows both my students and our visitors to relax and connect on a personal level. Typically, our visitors give a talk and spend 1-2+ hours at a virtual round table with me and the whole lab.

One could ask, do we really need such a virtual round table? Why not have conference talks and Q & A? I think that most young students that only know people as authors of papers and textbooks, do not really know where to start building their network without having a physical presence (e.g., in a conference). So, it is crucial that we create this platform! Essentially, I propose that each advisor invites other researchers and organize a few such virtual lab-visits per semester to ensure that our community continues to foster interaction between its existing and its younger members despite the current circumstances.

Another question one could ask is, isn't this really just a seminar talk? Well, my answer is yes and no. Giving talks virtually is an amazing opportunity to present and advertise one's research, and almost all data management labs I know, already host virtual seminars. However, I am arguing for something more than virtual talks. I want to highlight that we -- as a community -- should have virtual lab-visits, which would primarily focus on fostering direct personal relationships in addition to presenting recent results! We can also take this a step further and have virtual cross-lab visits, with visits back and forth between the two labs, which can end with a large round-table discussion between all lab members of both labs.

 

Let’s talk about deletes!

In the recent SIGMOD, we presented our work on a delete-aware LSM-based system called Lethe. In this post, I would like to share a few details that could not make it to the talk along with some thoughts based on the feedback we received during the conference.

First, this year is probably the first time that deletes were discussed as a key research challenge, as part of database systems offering privacy through deletion. Specifically, Mike Stonebraker discussed it during the very interesting panel "The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?". Along with our work on deletes for LSM-trees, this is probably the first time that deleting data becomes a first-class citizen. Up to now, we mostly worried about how to keep all the data we collected or generated. Today, we also have to worry about how to efficiently delete them if we have to (e.g., because of GDPR).

When studying the delete operator of modern data stores, we realized that virtually all deletes are out-of-place, even in the so-called in-place systems. For example, even when deleting in a B+ Tree or a heap file (which uses slotted pages) the most common approach is to invalidate the entry via a valid bit and leave the invalid data there for future "cleaning". To highlight this point, consider that often practical implementations of B+ Trees in full-blown relational systems do not even perform the merge after the case of underflowing due to a delete, because they expect new inserts to grow the tree in the future. Of course, this approach is further pronounced in LSM trees where a delete is implemented by design via out-of-place invalidation. In other words, in systems up to now, deletes are an afterthought, something that we may have to do, but we don't really worry too much about optimizing it (notable exceptions about optimizing the latency of bulk deletes are "Efficient Bulk Deletes for Multi-Dimensional Clustered Tables in DB2", "Efficient bulk deletes in relational databases" and "Online Bulk Deletion", however, they still do not focus on other aspects of deletes like privacy and space amplification). In addition to the efficiency of deletion and privacy, the new ubiquitous use of the cloud for data management poses a space management challenge. If one owns a disk, then invalidating data does not cause extra costs unless they have to buy new disks. On the other hand, if one rents storage from a cloud provider, keeping around invalidated data for long leads to unnecessary storage costs.

Moving back to performing out-of-place deletes, we observe that essentially, the purging of the data (which we call persistent deletion) is left as a task for later which will be undertaken by a compaction/garbage collection/defragmentation process depending on the exact system used. Leaving the persistent deletion for later may lead to a significant increase of space amplification and may expose sensitive deleted data (which technically should not be accessible after their deletion) increasing the privacy risk. These risks and costs are further pronounced in LSM systems where the garbage collection process is part of the original design.

Diving into more details about deletes on LSM-based data stores (LSM was originally introduced by ONeil in 1996, popularized by Google's Bigtable and LevelDB, as well as Facebook's RocksDB - a LevelDB fork, and since then is being used by a variety of systems - a great survey is available by Luo and Carey), we first realized that an LSM can receive four different types of deletes:

  • a point delete on the sort key of the LSM tree, which we call a primary point delete
  • a range delete on the sort key, or a primary range delete
  • a point delete on another attribute which is part of the value (or sometimes the whole value), which we call a secondary point delete
  • a range delete on this other secondary attribute, or a secondary range delete

When a workload has a non-negligible amount of deletes, the room for improvement is actually rather large. Note that any out-of-place system is already performing the (logical) delete very quickly simply by inserting an invalidating entry (also called a tombstone), as also discussed at a recent post from Mark Callaghan, which also raised some important questions about LSM read performance in the presence of tombstones. So, our focus for our work on Lethe is on the following challenges:

  • what is the cost in terms of latency or write amplification of persisting a logical primary delete? how can we minimize that?
  • what is the benefit in terms of space amplification if we persist the primary deletes faster?
  • how we should physically organize the data in the presence of secondary range deletes?
  • can we guarantee persistent primary or secondary deletion in a given deadline?
  • how is read performance impacted by addressing these challenges?

The above questions are discussed in Lethe, however, there are several open challenges. We are now considering some additional challenges, including:

  • Secondary point deletes in LSM trees. This seems to be a much harder problem as unless we have a secondary index, there not seems to be a way to avoid reading the whole tree.
  • Read queries at the presence of primary range deletes. After looking at the internals of systems like RocksDB, we see that range tombstones are significantly impacting the read path of an LSM tree leading to (potentially) avoidable I/Os from the storage simply to verify that an entry has been deleted.

Stay tuned for more work on efficient and persistent deletion!

=======

Edit: added a reference to a particularly interesting recent blog post from Mark Callaghan's blog.

The first ever virtual SIGMOD

A few days back I participated in the first-ever fully virtual SIGMOD due to COVID-19. The conference was a great success given the tremendous change of nature the organization (and the participants eventually) had to go through. In this post, I would like to summarize a few lessons learned from this experience.

  • First and foremost, this was a colossal effort from the conference organizers (general chairs David Maier and Rachel Pottinger, PC chairs Anhai Doan and Wang-Chiew Tan, and all other conference officers). Thanks to all for this amazing effort!
  • The virtual conference gave the opportunity to a wider community to participate in the conference and immerse in the data management research the SIGMOD community is conducting.
  • Having the talks captured beforehand significantly helped in terms of organization, however, it came at a cost. I got the feeling that many talks focused on providing a solid advertisement for the work rather than sharing technical details. I know that a conference talk should do both and that it's hard to find the perfect balance, but I feel that this time the scales were tipped. This is not meant to criticize the quality of the presentations, which was excellent, rather, to emphasize that for most talks I was so excited that I craved for more technical details -- of course, this probably means that all these talks did a great job in advertising their work.
  • The virtual Q & A enabled wide interactions, however, in many cases, it was not easy to convey a technical question. Face-to-face technical discussions can go deeper, faster! On the other hand, I think that the virtual Q & A was a great success for the panels where many questions are somewhat more high level and what usually scared junior people (a room full of seasoned researchers) was not there, so everyone got to ask their questions in several exciting panels "The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?", "Startups Founded by Database Researchers", and also the very important "New Researchers Symposium", and various social events including the "Women in DB: Experiences and Perspectives".
  • Last but not least, a recurring discussion during every SIGMOD is about the official SIGMOD membership. I would like to make three suggestions:
    1. To everyone interested in data management research, please make sure you are an ACM member (https://www.acm.org/membership) and a SIGMOD member (http://sigmod.org/membership/)!
    2. To future SIGMOD conference (but also VLDB) organizers, let's have these links easily accessible from the conference websites (not only during registration).
    3. To the SIGMOD leadership, let's make the registration an easier and seamless process. Since the cost is so small (10$-15$ if memory serves well), let's also make it equally easy to pay (e.g., via PayPal or other easy-to-use methods). Let's also consider different prices for parts of the world that a smaller price would make a difference!
    4. And a bonus suggestion: let's advertise and clarify the benefits of these memberships. For example, now that more and more of our papers are publicly available many young students may not see why they officially need to be ACM/SIGMOD members, however, being part of a community that works towards having free access to all of our work and help further shaping its future, is a great incentive!