InfoQ Homepage Presentations What is Derived Data? (and Do You Already Have Any?)
Felix GV explains what derived data is, and dives into four major use cases which fit in the derived data bucket, including: graphs, search, OLAP and ML feature storage.
Felix GV joined LinkedIn's data infrastructure team in 2014, first working on Voldemort, the predecessor of Venice. Over the years, Felix participated in all phases of the development lifecycle of Venice, from requirements gathering and architecture, to implementation, testing, roll out, integration, stabilization, scaling and maintenance.
Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
GV: My name is Felix. I am a Committer on the Venice Database Project, which was open sourced last year. I've been doing data infrastructure since 2011, the majority of which I spent at LinkedIn. You can find some of my thoughts on my blog at philosopherping.com. I'd like to cite one of the nice quotes from Winston Churchill. The nice thing about having the modern data stack, data mesh, real-time analytics, and feature stores is you always know exactly which database to choose for the job. Of course, Winston didn't say that. As far as I know, no one ever said that. In fact, I hear people saying the opposite. Year after year, they say, the database landscape is getting more confusing. There are just too many options to pick from. Here's a bunch of data systems, most of which you're probably familiar with, or at least heard about. What I want to do in this talk is give you a new lens to look through this landscape, and categorize these systems, because within this bunch of systems hiding in plain sight, are several derived data systems. I think it's useful if we understand the differences between primary data and derived data, and then which systems are optimized for each. Then we can make better choices about our architectures.