Subsurface LIVE 2023

Dremio’s Data Lakehouse Conference is the place to go to keep your knowledge sharp and learn the latest data lakehouse technologies and data engineering practices. With over 50 technical sessions from experts spanning the data industry, it is a great way to discover how others are using data lakehouse architectures and the lessons learned from their experiences.

2023 looks like the year for conferences to start opening their physical doors, as this event offered a virtual option as well as three live locations in San Francisco, New York, and London. The virtual attendees experienced the festivities through the Goldcast platform, which seemed to be pretty standard offering an agenda with 6 tracks, exhibit booths, general and session chats. The event also included some fun bonus features like a leaderboard, photobooth, and DJ Toasty!

Only 6 minutes into the start of the event, someone already asked the question ‘Will we be able to view the conference sessions as a recorded replay?’. This vibe continued throughout the event, with people asking about recordings and copies of the presentation slides. If you would like to watch the sessions on-demand, sign up here. If you would like to read a summary, I hope that you find this helpful!! 🤓

Since there were 50+ sessions, this is a summary of those I was able to attend and consider my favorites. The attendees seemed to enjoy the event as well.

Onto the show!

Summaries

The Year of the Data Lakehouse

Keynote

Tomer Shiran, CPO and Founder of Dremio, delivered Subsurface’s keynote. He opened by sharing that most companies are now living with conflicting forces, with the needs to deliver access, speed, and agility to the business while also maintaining data governance and security. Enter the rise of the data mesh, where data is now treated as a product, provided as a self-serve data platform, relies on domain ownership, and works as a federated computational governance (the pillars).

If you would like to learn more about the data mesh, check out these two books:

Data Mesh: Delivering Data-Driven Value at Scale by Zhamak Dehghani
Data Mesh in Action by Jacek Majchrzak, Sven Balnojan, Marian Siwiak

During this keynote, Tamas Kerekjarto, Head of Engineering, Renewables and Energy Solutions at Shell, and Deepika Duggirala, SVP Global Technology Platforms, at TransUnion shared their journeys to deliver governed self-service.

Shell is working on reshaping the energy system as they try to move to clean energy and digitalization. They are facing large data sets from multiple data sources and running 100+ models in parallel. Their experience with Dremio has helped with this to make it easy to scale and adjust by accommodating the large and growing datasets.

TransUnion has a goal for their data to live where it lives and allow access to only those that need it but also make it easy so that innovation happens across the board. Deepika jokes that this is a piece of cake, but only partially because Dremio had been a big help with this goal.

To wrap up the keynote, Tomer introduced a new topic of Data Lakehouse 2.0, called Dremio Arctic. This is the only data lakehouse management service that features data as code functionality, automatic maintenance, and open standards and technology.

This is proof that data is certainly changing quickly

Data as an Asset > Data as a Product > Data as Code

5 Use Cases for Data as Code

Experiment with data in transient branches
Ensure data quality with ETL branches
Reproduce models or analysis
Recover from mistakes
Troubleshooting

Putting Data Engineering & Data Science into the Next Gear Through a Modern Data Lakehouse at Shell

Natarajan Kalidoss, Head of Data Science and ML Engineering

Raja Perumalsamy, Manager Engineering

Shell Energy provides various integrated energy solutions in all aspects of the energy market and they are trying to predict long and short term electricity demand with higher accuracy. Due to the volatility of customer behaviors, weather conditions, social events, and other reasons it is very difficult to predict future demand.

The Load Forecast currently consists of clustering customers, using machine learning models for each cluster, applying cluster models to each meter, and tracking performance and continuous improvement.

The challenges that they face are like many others. There are large data sets that need to be transferred from multiple data sources, training 100+ models which need to run in parallel, and creating a lineage of multiple layers.

Enter Dremio. After adding this tool to their platform, they are now able to make their processes more efficient with data ingestion, transformation, preprocessing, model training, and inference generation.

Product Analytics @ Wayfair

Siddharth Jain, Engineering Lead

Wayfair is a platform focused exclusively on the home goods market with purpose-built technology solving unique category challenges and a strong brand recognition in North America and Western Europe.

Product Analytics is a process of gathering and transforming user-level data into insights that reveal how customers interact with products and services. It helps in understanding the use of apps and specific functionality that is utilized which in turn helps the business make improvements. This requires a need to quantitatively identify these opportunities for improvements and until now, the work in this area was done ad-hoc with no standard set of tools.

With more than 30+ applications there was a need to build analytic maturity, provide self-service dashboarding and analysis tools, standardize an interface for presentation layer tooling, and be able to search and explore data. With their new technical solution they were able to build foundational data pipelines, develop a strong software and data foundation, curate data in a presentation layer, and drive discoverability.

As this was a new program some lessons learned were that teams were not completely aware of the goals of the project. Therefore, they needed to put on roadshows to socialize the initiative as well as hold weekly office hours and bi-weekly sprint demos to show progress.

Using data science for Contact Center optimization

Siddharth Garg, Sr. Analyst

A contact center is basically the way your organization interacts with your customers and is available at pretty much every company.

Issues faced by contact centers include bad consumer experience, high wait times, bad agent management like dead air and call hanging, agent compliance issues, and agent attrition. The company’s goal with most contact centers is to adhere to targeted service levels but also needing to keep staffing costs down. Let’s look at some data:

U.S. companies lose over $62 billion in annual revenue due to poor customer service
88% of customers prefer voice calls with a live agent rather than other means
77% of customers view a business more positively if they’re proactive with customer service

This last point is where we can dive a bit deeper as data science / machine learning can help by increasing proactivity and reducing the need for customers having to call the customer service center. In fact, 66% of call center businesses were looking to invest in advanced analytics to provide better customer service as of 2020.

Cloud call analytics solutions provide other benefits such as customer insights, boosting agent productivity, operational efficiency, and data security.

Siddharth provided an example during his presentation with a focus on the mortgage loan servicing function which leveraged machine learning to examine the likelihood of loan default. This was able to identify a risk strategy which helped schedule calls for the right person at the right time through a ‘phased’ and ‘right time’ calling approach.

CI/CD on the Lakehouse:

Making Data Changes and Repair Safe and Easy

Alex Merced, Developer Advocate Dremio

This was another presentation that got into Data as Code. I am enjoying this new topic, probably because I will always be an auditor at heart. Anytime you are auditing or testing something, I AM IN!

CI/CD is the practice of using automation to integrate and deploy new code safely and easily. Automating tests, audits, and publishing of an application. This ultimately affects business value through providing good quality insights, which vastly relies on the quality of the data.

We earlier learned about 5 use cases for Data as Code, but from a CI/CD context it provides the following benefits:

Isolation: experiment with data without impacting other users
Version Control: reproduce models and dashboards from historical data
Governance: all changes to the data and metadata are tracked

Because of this, you can easily implement data quality checks without having to worry about disrupting the data and/or the users. You are able to easily check for duplicates, missing records, incorrect data, hidden data, and even check the referential integrity.

Enter Dremio. This tool provides the most power here because you are able to automate all of these checks, balances, audits, and services.

Block-by-Block: Making web3 Data Accessible Using an Open Lakehouse Architecture

Luke Kim, Founder and CEO Spice AI

Lessons learned from a year of building Spice.xyz - a data and AI platform for web and blockchain data leveraging open lakehouse technologies, including Apache Arrow and Iceberg.

I thought this was a good lessons learned / future view about Dremio I wanted to share.

PRQL: a modern language for transforming data

Tobias Brandt, Contributor and Developer Advocate PRQL

Aljaž M. Eržen, Compiler Developer EdgeDB

Tobias opened this session with the understanding that no one wants to learn yet another version of SQL, there are too many already. However, this one is kinda cool… but aren’t they all??

SQL is basically the lingua franca of data

But if you think about it, there are some things that could be better about SQL:

Its use of relational algebra without explanation
Patched syntax, it looks like COBOL from 1970s
Not really composable and does not feel like a programming language...
Not really consistent
Too many dialects!

How do we fix this?

By creating a new language!

PRQL is interactive with a fast development cycle and ability to fail early. It is a simple, powerful, pipelined SQL replacement. It even has a VSCode Extension and works with Python and R. And.. it’s pronounced “Prequel”

Some of us may need to revisit our pronunciation of SQL 😂

To learn more you can visit the website, check out the GitHub page, and head over to the playground to get your hands dirty.

They even have a Discord!

Tech Roast Show

Socially Inept

I rated this session as “Hilariously awkward”. The amount of times there was silence or just straight up odd questions / answers was fantastic. I am a huge fan of comedy and socially awkward myself, so I was really glad to have been attending virtually. Especially when the guys called out my name as one of the contest winners, why did my face go red from home?? 😳🤣

They get a 10/10 from me, they should be invited to more data events. I am thinking I should ask them to be a guest on my Data Podcast for Nerds! …

One of the games they played was called Company or Cookie (Fortune Cookie to be exact)

Check them out on YouTube

Data Mesh in Practice

Host: Ben Hudson

Guests:

Ugo Ciraci, Engineering and Business Unit Leader of Utility and Telco

Zhamak Dehghani, Founder and CEO Nextdata

Raja Perumalsamy, Manager Engineering Shell

Tomer Shiran, CPO and Founder of Dremio

This was a panel that discussed the practical considerations of data mesh, tips for addressing the people and cultural aspects of data mesh, and real-world lessons learned when implementing a data mesh or data mesh-like approach.

Highlights:

There is no single pattern or one way that works when implementing the data mesh
If you have management’s full support, it will be easier to onboard technical and non-technical users
Linking different departments together is challenging and needs to be done carefully
Do not go pillar-by-pillar when implementing the data mesh, you want to take a thin slice from across each pillar, implement, rinse and repeat
Think of implementation as a hello world version of data as a product, start with 1 or 2 domains and work backwards
People think ‘Build a platform, they will come’ but that is wrong. You need to find and start with the absolute minimum that people agree upon
Be aware that if you don’t have any quick wins, management will be asking ‘Why are we spending all this money’
Make choices to allow you to scale up in the future, throughout the organization
Focus on practices first, then on the technology that will enable and make that a reality
At the end of the day, data mesh is both a technology and organizational solution. There is only so much you can do as a platform if the org is having an issue with things like data ownership

Lightning Round:

Advice for those wanting to build a data mesh

Think about the largest population of technologies that you have to build, use, share
Try to focus on business uses and quick wins, involve the business
Get started with project and demonstrate quick win
Empower developers to use the tools!
Sign up for Dremio Cloud 🙂

Data Mesh in Action

Marian Siwiak, Chief Data Science Officer

If you want to implement a data mesh, you have to think carefully if it is good for the business as there are three main factors involved. Only when there is overlap is when you might need a data mesh.

If you feel that you have identified the need for a data mesh, you then should follow a roadmap to kick-starting the Data Mesh:

The landscape diagram should include the process execution flow, data flow, and risk layers.
The stakeholders should be identified on a power vs. interest matrix
The right people should consist of those from both the development and governance teams
Data governance should include policies, standards, and have both centralized and local layers
Data products should always focus on business viability, not on the tech-specs!
Data platform can be git-based for maximum efficiency

Unboxing the Concept of Drift in ML

Supreet Kaur, Assistant Vice President

Drift is a common and inevitable phenomenon that occurs during the lifecycle of your model and is one of the leading causes of performance deterioration. Machine Learning Models are built on an assumption that historical data is the true reflection of the future but in a fast changing world, that is rarely the case.

Ths was probably one of my favorite sessions because I find this phenomenon fascinating due to its complexity. Therefore, I will be sharing a lot of slides to give max info. 😀

Drift occurs when the model performance degrades over time which is detected by a decrease in the performance of the model compared to the training performance. These changes occur for multiple reasons including, change in inputs, changes in the target variable, or a bias in the data.

There are a few types of drift, split between Model Drift and Data Drift. Supreet says to take these terms with a grain of salt because they might not exactly match the definition.

Some techniques can be used to deal with and mitigate the effects of drift. Again, Supreet mentions that these are complex and should be explored more than just reading the summary/slide, but are very interesting!

Lastly, to share some best practices to deal with drift with a note that not all of these are for every use case as it depends on the details and whether you need a reactive or proactive approach.

Conclusion

Subsurface was sublime, I learned a lot about data lakehouses, the data mesh, and Dremio as a company. My favorite fact from Gnarly was his expertise in SQL, which is a super critical skill that all data professionals should have in their belt of skills.