Virtual lab-visits at the time of the pandemic

One of the biggest challenges during the pandemic has been to further work on the community building that typically happens through in-person live interactions in conferences, invited talks, workshops, lab visits, etc, and fuels research progress.

Most conferences quickly transformed to an online-only endeavor, which entailed a tremendous technical and organizational effort for which, I believe, everyone is thankful for the tireless work of the leadership of all our conferences (e.g., SIGMOD, VLDB, EDBT, ICDE, KDD). While here I refer to data management conferences, I am confident that the same sentiment is shared in other communities.

In the online-only version of the conferences, however, there was one aspect that was left behind. Any research community is indeed called a community because its members know each other and once a young researcher has started becoming a member of this community, she or he can seek advice or help from the human network that she or he has started cultivating. However, this community building is something very hard to achieve without direct communication (ideally in-person). In conferences, typically, a seasoned advisor often serves as the link between young students and other researchers that might help with some feedback on a project or some interesting discussion which may lead to a collaboration or even to a lifelong friendship. How are young graduate students expected to start cultivating these types of relationships in the pandemic era?

While I certainly do not have a final answer in mind, I have a proposal that I am experimenting with since Fall 2020. It is really a simple idea. For as long as traveling is virtually impossible, I am trying to have 3-4 virtual visitors in my lab per semester, and enable direct interaction between our visitor and all the lab members. I have been experimenting with a combination of Zoom and Gather (which was also used by SIGMOD and VLDB among other conferences). The latter not for its (not) superior video quality, but for its casual look-and-feel, which allows both my students and our visitors to relax and connect on a personal level. Typically, our visitors give a talk and spend 1-2+ hours at a virtual round table with me and the whole lab.

One could ask, do we really need such a virtual round table? Why not have conference talks and Q & A? I think that most young students that only know people as authors of papers and textbooks, do not really know where to start building their network without having a physical presence (e.g., in a conference). So, it is crucial that we create this platform! Essentially, I propose that each advisor invites other researchers and organize a few such virtual lab-visits per semester to ensure that our community continues to foster interaction between its existing and its younger members despite the current circumstances.

Another question one could ask is, isn’t this really just a seminar talk? Well, my answer is yes and no. Giving talks virtually is an amazing opportunity to present and advertise one’s research, and almost all data management labs I know, already host virtual seminars. However, I am arguing for something more than virtual talks. I want to highlight that we — as a community — should have virtual lab-visits, which would primarily focus on fostering direct personal relationships in addition to presenting recent results! We can also take this a step further and have virtual cross-lab visits, with visits back and forth between the two labs, which can end with a large round-table discussion between all lab members of both labs.

 

Let’s talk about deletes!

In the recent SIGMOD, we presented our work on a delete-aware LSM-based system called Lethe. In this post, I would like to share a few details that could not make it to the talk along with some thoughts based on the feedback we received during the conference.

First, this year is probably the first time that deletes were discussed as a key research challenge, as part of database systems offering privacy through deletion. Specifically, Mike Stonebraker discussed it during the very interesting panel “The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?“. Along with our work on deletes for LSM-trees, this is probably the first time that deleting data becomes a first-class citizen. Up to now, we mostly worried about how to keep all the data we collected or generated. Today, we also have to worry about how to efficiently delete them if we have to (e.g., because of GDPR).

When studying the delete operator of modern data stores, we realized that virtually all deletes are out-of-place, even in the so-called in-place systems. For example, even when deleting in a B+ Tree or a heap file (which uses slotted pages) the most common approach is to invalidate the entry via a valid bit and leave the invalid data there for future “cleaning”. To highlight this point, consider that often practical implementations of B+ Trees in full-blown relational systems do not even perform the merge after the case of underflowing due to a delete, because they expect new inserts to grow the tree in the future. Of course, this approach is further pronounced in LSM trees where a delete is implemented by design via out-of-place invalidation. In other words, in systems up to now, deletes are an afterthought, something that we may have to do, but we don’t really worry too much about optimizing it (notable exceptions about optimizing the latency of bulk deletes are “Efficient Bulk Deletes for Multi-Dimensional Clustered Tables in DB2“, “Efficient bulk deletes in relational databases” and “Online Bulk Deletion“, however, they still do not focus on other aspects of deletes like privacy and space amplification). In addition to the efficiency of deletion and privacy, the new ubiquitous use of the cloud for data management poses a space management challenge. If one owns a disk, then invalidating data does not cause extra costs unless they have to buy new disks. On the other hand, if one rents storage from a cloud provider, keeping around invalidated data for long leads to unnecessary storage costs.

Moving back to performing out-of-place deletes, we observe that essentially, the purging of the data (which we call persistent deletion) is left as a task for later which will be undertaken by a compaction/garbage collection/defragmentation process depending on the exact system used. Leaving the persistent deletion for later may lead to a significant increase of space amplification and may expose sensitive deleted data (which technically should not be accessible after their deletion) increasing the privacy risk. These risks and costs are further pronounced in LSM systems where the garbage collection process is part of the original design.

Diving into more details about deletes on LSM-based data stores (LSM was originally introduced by ONeil in 1996, popularized by Google’s Bigtable and LevelDB, as well as Facebook’s RocksDB – a LevelDB fork, and since then is being used by a variety of systems – a great survey is available by Luo and Carey), we first realized that an LSM can receive four different types of deletes:

  • a point delete on the sort key of the LSM tree, which we call a primary point delete
  • a range delete on the sort key, or a primary range delete
  • a point delete on another attribute which is part of the value (or sometimes the whole value), which we call a secondary point delete
  • a range delete on this other secondary attribute, or a secondary range delete

When a workload has a non-negligible amount of deletes, the room for improvement is actually rather large. Note that any out-of-place system is already performing the (logical) delete very quickly simply by inserting an invalidating entry (also called a tombstone), as also discussed at a recent post from Mark Callaghan, which also raised some important questions about LSM read performance in the presence of tombstones. So, our focus for our work on Lethe is on the following challenges:

  • what is the cost in terms of latency or write amplification of persisting a logical primary delete? how can we minimize that?
  • what is the benefit in terms of space amplification if we persist the primary deletes faster?
  • how we should physically organize the data in the presence of secondary range deletes?
  • can we guarantee persistent primary or secondary deletion in a given deadline?
  • how is read performance impacted by addressing these challenges?

The above questions are discussed in Lethe, however, there are several open challenges. We are now considering some additional challenges, including:

  • Secondary point deletes in LSM trees. This seems to be a much harder problem as unless we have a secondary index, there not seems to be a way to avoid reading the whole tree.
  • Read queries at the presence of primary range deletes. After looking at the internals of systems like RocksDB, we see that range tombstones are significantly impacting the read path of an LSM tree leading to (potentially) avoidable I/Os from the storage simply to verify that an entry has been deleted.

Stay tuned for more work on efficient and persistent deletion!

=======

Edit: added a reference to a particularly interesting recent blog post from Mark Callaghan‘s blog.

The first ever virtual SIGMOD

A few days back I participated in the first-ever fully virtual SIGMOD due to COVID-19. The conference was a great success given the tremendous change of nature the organization (and the participants eventually) had to go through. In this post, I would like to summarize a few lessons learned from this experience.

  • First and foremost, this was a colossal effort from the conference organizers (general chairs David Maier and Rachel Pottinger, PC chairs Anhai Doan and Wang-Chiew Tan, and all other conference officers). Thanks to all for this amazing effort!
  • The virtual conference gave the opportunity to a wider community to participate in the conference and immerse in the data management research the SIGMOD community is conducting.
  • Having the talks captured beforehand significantly helped in terms of organization, however, it came at a cost. I got the feeling that many talks focused on providing a solid advertisement for the work rather than sharing technical details. I know that a conference talk should do both and that it’s hard to find the perfect balance, but I feel that this time the scales were tipped. This is not meant to criticize the quality of the presentations, which was excellent, rather, to emphasize that for most talks I was so excited that I craved for more technical details — of course, this probably means that all these talks did a great job in advertising their work.
  • The virtual Q & A enabled wide interactions, however, in many cases, it was not easy to convey a technical question. Face-to-face technical discussions can go deeper, faster! On the other hand, I think that the virtual Q & A was a great success for the panels where many questions are somewhat more high level and what usually scared junior people (a room full of seasoned researchers) was not there, so everyone got to ask their questions in several exciting panels “The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?“, “Startups Founded by Database Researchers“, and also the very important “New Researchers Symposium“, and various social events including the “Women in DB: Experiences and Perspectives“.
  • Last but not least, a recurring discussion during every SIGMOD is about the official SIGMOD membership. I would like to make three suggestions:
    1. To everyone interested in data management research, please make sure you are an ACM member (https://www.acm.org/membership) and a SIGMOD member (http://sigmod.org/membership/)!
    2. To future SIGMOD conference (but also VLDB) organizers, let’s have these links easily accessible from the conference websites (not only during registration).
    3. To the SIGMOD leadership, let’s make the registration an easier and seamless process. Since the cost is so small (10$-15$ if memory serves well), let’s also make it equally easy to pay (e.g., via PayPal or other easy-to-use methods). Let’s also consider different prices for parts of the world that a smaller price would make a difference!
    4. And a bonus suggestion: let’s advertise and clarify the benefits of these memberships. For example, now that more and more of our papers are publicly available many young students may not see why they officially need to be ACM/SIGMOD members, however, being part of a community that works towards having free access to all of our work and help further shaping its future, is a great incentive!