Interview by The Brink

A few weeks ago, I was interviewed by Jessica Colarossi for The Brink, a BU publication that delivers the latest news from Boston University research. The full piece can be found here. It covers the research we conduct in the DiSC lab, as well as the research of one of my colleagues, Masha Kamenetska, who holds a joint appointment in the departments of Chemistry and Physics, both recently funded by an NSF CAREER award.

In the process of this interview, we had an interesting exchange that I would like to share. The full exchange follows:

Can you briefly explain the work you’re doing with the [NSF CAREER] award funding? 

This NSF CAREER grant will have a transformative effect on the way we build and tune data systems. We will develop a new breed of data systems that can offer near-optimal performance despite potential uncertainty in the execution setting and increase performance predictability. We now use public (and private) cloud for most of our computing. As a result, our applications face interference from other applications, leading to various forms of workload and resource unpredictability. This proposal, at its core, aims to build data systems that can address this unpredictability. 

In addition to research, this proposal will help the Data-intensive Systems and Computing Lab at the CS department (https://disc.bu.edu), which I founded in 2019, to establish a strong educational and outreach impact. The funds already support a number of graduate students, the development of new CS classes and educational material, and targeted internship programs. 

How are log-structured merge (LSM) based data systems used currently, and what problems will you and your team be addressing? 

In this project, we work on LSM-tree-based key-value stores, a class of data systems used by a wide variety of SQL and NoSQL applications supporting social media platforms, data-intensive applications like accumulating sensor or e-commerce data, and more. A very influential LSM-based system (but hardly the only one) is RocksDB, an open-source key-value store currently developed by Meta and used both at Meta and Facebook as part of their infrastructure. Note that several data-intensive workflows and systems outside Meta also use RocksDB. LSM-tree stands for Log-Structured Merge tree and offers a tunable balance between optimizing for fast reads and fast writes (updating existing or storing new data). When deploying LSM-based systems in the wild, we use information about the application to tune them better to offer the best possible performance. However, as data-intensive applications are increasingly being deployed in private and public clouds, our assumptions about the application properties (e.g., workload characteristics, access patterns, read/write ratios) and the available resources (e.g., amount of memory, quality of service of the available hardware) come with a degree of uncertainty. Because of that uncertainty, it is becoming harder to tune LSM-based data systems. In the project, we capture and model uncertainty by innovating in the tuning process to offer optimal (or close-to-optimal) performance of our data systems even when the workload they face departs from the one they initially expected. This will require a concerted effort to fuse expertise and techniques from computer and data systems, algorithmic data mining and machine learning, and low-level hardware understanding. I am fortunate to be part of a department that offers a fertile environment to cultivate those connections.

Why do you think it’s important to upgrade and optimize these data systems? 

The new breed of data systems we envision will require less human intervention. As a result, they will be more appealing to users and administrators, allowing organizations to widely deploy reliable systems to accelerate and support data-intensive scientific discovery and applications. Such ease of deployment is timely as data management is increasingly becoming an automatic service, where human experts cannot always be in the loop to hand-tune and optimize data systems.

How did you become interested in data systems and computer sciences? 

My interest in computer science dates back to when I was a grade school student. As far back as I remember, I was fascinated by computers and the creativity that came with them. When I wrote my first program in 5th grade, I felt I had just created “something out of nothing”. This feeling followed me throughout college when I was really interested in building computer systems. During college, I was also drawn to data management because its principles were built on logic and helped me make sense of the world around me. Those two topics – data management and computer systems – are the foundations of the research we conduct in the lab. We want to build technological tools to manage and make sense of data – and thus the world around us.

By the end of the five-year grant, what do you hope to accomplish?

The main scientific goal of this grant is to develop the data systems tools needed to address uncertainty in the workload and the available resources. Our goal is to push the state of the art of data systems by making our findings and contributions part of practical systems. To that end, we already apply our ideas to practical systems, and in addition to our scientific publications, we make all our artifacts (code, data, scripts) public. Research, however, is only one side of it. Every grant, especially an NSF CAREER grant, also focuses on education and outreach. The way I prefer to see it is to focus on people. One of my main goals is to help develop a new generation of data systems experts, both at the Ph.D. level and the undergraduate level, who will enjoy the process of building systems (and learning in general), and they will also acquire technical and societal skills that will help them both to live a fulfilling life and tackle significant real-life problems down the line. This is what I see as the most important metric of success.