The Data Observability Summit
Hosted by Monte Carlo
Monte Carlo hosts an annual Data Observability event which spotlights some of the industry’s most prominent voices, as well as the broader community of data leaders paving the way forward for reliable data.
This event was a hybrid event, with the virtual folks on the Hopin platform (I love this platform) and 3 in-person welcome receptions in New York, Bay Area, and London. The event spans across two days and includes 8 keynote sessions and 18 breakout tracks based on two themes, Data Leaders and Technical Architects. Both days ended with a virtual entertainment session by OrchKeystra who brought the best of jams!! What a perfect way to end a day.
Day 1
The start of the session began a little rocky with some technical difficulties, but honestly I haven’t been to a conference that doesn’t have these little hiccups, I think it oddly makes things more human.
Before the session officially kicked off, we saw a chat between Barr and some of her partners, 1 week prior to this event, talking about the right way to kick it off. There were many ideas, a lot centered around Halloween (my favorite holiday, so I was excited). I love what they settled on, “Data Engineer Remix: Emails Mad at Me! ("Somebody Mad at Me" Parody)”, this was the funniest parody I have heard recently. Check out the Halloween jam here, courtesy of Jacq Riseling and Will Robins.
Barr Moses, CEO & Co-Founder at Monte Carlo formally kicked us off by introducing Data Observability as the topic and mentions how hot it is to be a Data Engineer in tech right now. Data Observability is an organization’s ability to understand the health of the data in their systems. This makes data more important than ever but we all know that data still has its bad days. Barr shared that Unity, the cross-platform game engine, lost $100M because of bad data and Equifax assigned wrong scores to millions of customers. Goes to show, bad data spares no one.
One good news is that a new Gartner Research Report reflects growing interest in data observability 🏆
Separating Signal from the Noise: The Prediction Paradox
Nate Silver
Founder and editor-in-chief, FiveThirtyEight
Nate’s book “The Signal and the Noise: Why So Many Predictions Fail–but Some Don’t” started us off with a great quote
If the future was easy to predict, we wouldn’t need data analysis
Today, Nate shared with us the 11 Guidelines for High-Quality Data Analysis (the first three of which came from his book):
Think probabilistically
Know where you’re coming from
Try, and err
The highlight with these three items is that we live in a world where uncertainty is all around us. We all have different perspectives, so we need to think about what happens when we come together as a company or as competitors. Typically when a crowd gets together, they are more wise than the individual. But there is a balance when learning new information in how to deal with it in relation to the information that you already know. Do you take it as face value and move on or do you update your prior knowledge (think Bayes Theorem).
Nate then references a great complimentary book: The Wisdom of Crowds by James Surowiecki.
Three takeaways / overlapping items:
Crowds should be diverse in things such as skill sets, perspectives, etc.
Independence, if they can’t provide their point of view they can’t bring much to the table
Trust, abide by the rules and everyone working in their best interest
The remaining guidelines:
Define the problem
Treat data as guilty until proven innocent
Kick the tires
Break it down
Learn the Rules of the Game (which I have now just lost, and you as well, IYKYK)
Don’t lead with your gut
Don’t be afraid to be critical
Focus on process, not results
Basically, you need to define the problem so everyone knows the goal and is set up to be successful in finding the solution. Once you find a solution, make sure that it is the right one. Test it, question it, ask for a second opinion, visualize it, phone a friend, anything you need to do to make sure that the solution makes sense. Know that there will be obstacles, hills, holes, and maybe some wet spots you have to leap over. Just be sure to set yourself up for success at the beginning, be prepared and enjoy the process.
Extraction and Abstraction: A Conversation with
Co-founder of Looker Architect of LookML and Malloy
Rafa Jara Simkin, Enterprise Sales Leader @ Monte Carlo talks with Lloyd about his journey, from how he became the most well known entrepreneur in the data space to how he created LookML and Malloy.
When you first built LookML, what problems were data teams facing?
Lloyd has been a CTO at a lot of companies, which has the responsibility of making sure everyone in the organization knows what is going on and the best way to do that is through data. He wanted to build tooling so that people can analyze the data to understand what is going on in real time. Looker was his 4th product, and mentioned that the first versions of LookML were not very pretty. He learned that it is hard to make solutions look nice and have people understand how to use it.
Did you see LookML becoming as big as it is?
He didn’t know what kind of company it was going to be, whether it would be a software, product, services, or even a consulting company. He was never sure it was going to work until it was working.
Outside of solving a problem, what is the biggest takeaway you can share?
He saw Looker as more than just providing software. He saw it as an education company, with the job to educate new ways to work with and look at data. He knew that you need to make a safe environment for customers in order to make them successful for you to be successful.
Can you share your perspective on data reliability, quality, management with different tools in the data stack?
Building reliability in software is really hard. This includes things like version control, testing of changes, having a good team, etc. Continuous integration and continuous testing paired with good engineering software practices are super important.
What is Malloy?
Honestly, this one was hard to capture on paper.
What I was able to capture was that Malloy is all open source with a goal of being used everywhere. Anywhere you are typing SLQ, you will be using Malloy. Currently, Malloy supports BigQuery, DuckDB, and PostgreSQL.
Lloyd’s advice: just try it. Here is a link
Do you have any entrepreneur tips?
Acquire lots of different types of skills
When you are small, know that you will not have enough people to do it all very well. When you start to grow, you can hire people to fill the missing gaps. Loved this analogy: it’s like throwing a bunch of paintballs at a wall. You want the whole wall to be covered. Some of the wall is covered very well but there are some holes… that is where you can focus to hire.
Have a clear sense of mission, what you are doing, what you are trying to achieve, and what is your value proposition. Do not focus on the how up front, just know what problem you are trying to solve.
Listed to your customers
Build Your House on Rock, not Sand
Head Of Analytics, Understood
The basis for this session was that data needs a strong foundation, otherwise there will be chaos.
Yay analogies! 🏡
If you ask a friend who just purchased a new home, ‘What is your favorite thing?’
Answers: ‘The finished basement’, ‘The upstairs game room’, ‘The farm sink in the kitchen’
NOT Answers: ‘The floors don’t creek’, ‘The water is not brown’, ‘The heater doesn't catch on fire’
People are rarely ever focused on what matters the most, things like these are just expected to exist and work well. But without a strong foundation, your house (data) is going to collapse.
The other thing is that a lot of executives have FOMO with all the hype and buzzwords flying in and out of the scene. They see the things that get people’s attention and they don’t want to be left behind. However, sometimes they skip the part that matters… do we even have the data (or more importantly, the right data / good data). No one wants to experience the tragic garbage in, garbage out solution.
Without paying attention to the things that need attention, you will get nowhere
Q&A Time!
Do you recommend a data quality tool like Informatica Data Quality, if you have a data observability tool in place?
Depends. First, figure out what it is that you want to look at and what questions you are asking. (This guy has all the analogies!)
If you have 10 windows in your house, do you want 30 monitors or just 1 monitor on the back door?
Think through what is in your stack, where the failure points are, and how to best mitigate those failures.
What tips do you have for data teams trying to make "unsexy" ideas like data governance and MDM more attractive?
Talk about what is going to happen if they do not pay attention to these things. Think about ‘what’s in it for them’ and explain why it is important to them in order to get people on board. From Monica’s experience: if you can put it in money terms, you have a better shot.
When should you implement a Data Observability solution?
There isn’t a straight answer to this, but do not wait until your data is broken (this is reactive). The earlier you can get on it, the easier it will be.
Data Fitness for a Healthy Business
Sr Director of Data, Strava
There was A LOT of really good stuff here! Shailvi walks us through definitions of bad data, why businesses should care, where bad data is introduced, how to identify bad data, how to figure out the reason for bad data, and how to prevent bad data!
What is Bad Data?
Inaccurate data, incomplete data, misleading data
Bad Data is NOT
Data showing you something you don’t want to see
Why should business care?
Bad data is costly in many different ways
The more widespread bad data is in organizations, the worse it becomes
It lowers trust
Damaging to your representation
Loss of revenue/increase in costs
Bias = harm (harm to real people through bias, hidden biases, $$ and people’s lives)
Liabilities
Productivity & morale
Speaking of morale, people should be an organization's first priority. If your people are low on morale, everyone is going to have a bad time.
Bad data can be introduced at any of the phases of the Data Lifecycle:
Define > Log > Transform > Analyze > Share
“Definition” phase
Uneven feature definition
Myopic definition
Incorrect input parameters
Problems arise with how strict or lose a definition is (example: organizations wanting to know how many users they have but each department can define users differently)
“Logging” phase
Incorrect tracking (classification)
Faulty pipeline
Inconsistent timeframes (stale data)
“Transforming” phase
Un-intuitive rules (clear data dictionaries helps)
Meaningless aggregations (if all transformations happen at once, can’t work backwards to get back to the raw data)
Logical errors
As a baker, a lot can go wrong from the raw ingredients to the finished product.
“Analyzing” phase
Ambiguous problem definition (everyone needs to be on the same page to answer the same question)
Inapplicable model / formulae
Biased algorithm (can cause harm to underrepresented groups In the population)
“Sharing” phase
Faulty reporting
Misinterpreted results
Unintended downstream use
Diagnosing Bad Data
How bad data shows up
Results that don’t match
Suspicious results
Look for the obvious reasons
Was it bad data, or bad assumptions?
Was the dataset built for your use case?
Was there a known bug or pipeline issue?
THEN, Go through data lifecycle in reverse order
At which phase is the data still accurate?
In the phase where it breaks, what could the causes/reasons be?
Shalvi shares this cool Sanity Checklist
Really helps to discuss this with others as well, to validate initial assumptions
Curing bad data is a cross-functional effort
Prevention is better than cure!
Reconcile terminologies
Automate everything - no manual stuff
Simplify your curated datasets
Have a consistent governance & ownership philosophy
Audit at regular intervals
How to Make Data Governance Fun for Everyone
Head of Product Data Governance, Adidas
First off, if you know Tiankai, you know that he makes data fun for everyone!!
We started the session off with a song, all about Data Governance (a great way for us data professionals to explain what we do to our parents/spouse 😂)
Jacq Riseling then asks all the juicy Data Governance questions and Tiankai shares everything!
How do you incentivize data governance?
Data Governance is a long term initiative which can be hard to incentivize. Some don’t think that it is worth purpusing or even want to talk about it.
A couple things to help:
Start with the right mindset
Get commitment of a stakeholder
Tiankai thinks that Data Governance needs some rebranding, it is oftentimes seen as the police but Data Governance has the interest of the whole company and is a great thing!
How to get stakeholders to care?
Let them know what’s in it for THEM. Different teams will have different answers, so cater your communications to your audience.
As I like to see it, this is not a Halloween costume, there is not a one size fits all approach.
What about external stakeholders?
This depends on the industry. Financial services and pharmaceuticals are highly regulated and must follow certain laws. Here, it is really just about making it happen. Other non-regulated industries, like retail, make things a bit harder because there needs to be a mutual agreement to make them care about Data Governance more. Again, it is all about communication and to share what’s in it for them.
What is the role of empathy in Data Governance?
There are a lot of definitions of Data Governance, but in summary it is all about making data usable in the right way for the right people. We should empathize with the user side because they are the ones trying to use the data.
So think about the users, ask the right questions and actively listen.
How do you make Data Governance more creative? +
How make data fun for leadership, peers, collaborators, and external stakeholders?
Use analogies and examples
Using unconventional and unexpected formats like songs, podcasts, marketing campaigns with prizes, and gamification
Lyric from his song: ‘If data is the force, we are the Jedi Council’
Because the Data Governance council sounds boring… and Star Wars is way more fun!
What is the role of Data Quality in Data Governance?
Data quality is a key outcome of good Data Governance. Data Quality will turn into a KPI, which Data Governance will be measured upon. What is good Data Quality can be tricky to define and needs to be agreed to by everyone.
Data Quality can be quite emotionalizing and people can be frustrated by the topic. Data Quality is sometimes used as a catch all for any bad data, even if not related, it is ‘just different than expected’.
Top 3 Data Governance challenges:
Data Ownership: Data Governance team cannot fix everything alone
Data Quality: everyone needs to agree on the definition
Perception of Data Governance: can result in getting Data Governance involved too late
Day 2
Starting at the end, I just wanted to say how much I loved ending the two sessions rocking out to the music stylings of OrchKeystra. Y’all should really go check him out!!
Back to Day 2
What's Next for the Modern Data Stack?
Founder's Panel
CEO & Founder at dbt Labs
CEO at Fivetran
CEO Founder
Lior Gavish, Co-founder at Monte Carlo kicked us off with a panel to talk about What’s Next for the Modern Data Stack. Unfortunately, I only caught the very end because my computer decided to reboot itself, my earbuds needed to charge, and my mouse died (no animals were harmed). After I got my life back in order, I was able to catch the end where there was some discussion around data contracts and it seems that they have evolved since I was dealing with contracts back in my day (not really that long ago). It seems that they are more complicated now as they have to deal with more than just establishing accountability and ownership for the data. I would be interested in learning more here.
Data Reliability in the Lakehouse:
Fireside Chat
CEO and Co-Founder, Monte Carlo
CEO of Databricks
Can you take us through the Databricks origin story?
Back at UC Berkeley, Ali and his team started getting more and more funding from Silicon Valley, and at that time they got to see what other tech companies were working on. They noticed all the cool things they were doing: leveraging data, using data science, and machine learning, predicting things like couples breaking up (2009 Facebook). He wanted to bring that type of tech to the world, and give to everyone. Through a lot of trial and error, they decided to ‘just do it themselves’ and were able to start their own company.
One of my favorite ‘failures’ Ali shared was that they tried sending students into orgs as intern trojan horses to try to get the org to adopt the technology. It was worth a try!
How do you think about your customers, when thinking about building a community?
Being open source gives you the opportunity to have a lot of people using / downloading your software. This is an advantage as it allows you to start building out a community and get recognized. However, it is still difficult to get adoption at the same level as with B2C companies. So the next question was, how to monetize. They saw that there was a gap in having users download / use their software and monitoring / configuring the software to fit best with the org. This is where they found their secret sauce, they create the open source software, get mass usage, then offer the SaaS version and charge for the maintenance. EASY!
What was it like to move into the role of CEO?
Going from a tech expert to managing a bunch of functions you know little about was quite different and there were a lot of questions that needed to be explored.
What is a good CEO?
How do you know what good looks like ?
How do you give advice on something you know very little about?
How do you get your team to respect you and get along with each other while executing towards a common goal
How to crack the code:
Learn really really fast
Get out of your comfort zone
Talk to people who have done it and are great at it
Meet as many expert CEOs and pick their brain, you will start seeing patterns
At end of the day, each company has a strategy and you need to make things fit into that strategy
Lakehouse: How did we get here? Why now?
What has changed the most is the amount of data. Data warehouses cannot be the answer to everything because not everything can be stored in a data warehouse anymore. This was the start of the switch to data lakes. However, depending on your use case you need more than just a data lake. BI needs a warehouse structure but data science, ML, and real time use cases can live outside of the data warehouse.
These two solutions started creating problems because in some instances you now have two copies of your data, one in the lake and one in the data warehouse. This is a problem because the policies are a little different files vs.tables) and you need to manage changes in the two locations. Databricks wanted to simplify things and did so by making data useful where it was, in the lake, by basically building a house on that lake.
What are the trends and adoption of AI and ML?
Most organizations were really excited about the AI buzzword back then when Databricks started. But at that time, organizations didn’t have a clue what was involved, they didn’t have the readiness or understand the basics.
So then everyone started collecting data, building strong data teams, and understanding the data stack. This led to the data community becoming more and more important for the organizations.
What is Unity Catalog and how does it transform data governance?
Unity catalog is about specifying who has access to the data. It is a unified governance solution for all data and AI assets including files, tables, machine learning models and dashboards in your lakehouse on any cloud.
As an example, it is helpful if you are a regulated company since you need to focus on data quality and privacy. Your data needs to be appropriately locked down, encrypted, and provisioned on an as needed basis.
What’s next on the roadmap, what are you looking forward to?
Biggest goal is to simplify the lakehouse data stack. Simpler autoML, data processing, and ability to share the dashboards and integrate into other tools.
What is your favorite book?
What Got You Here Won't Get You There: How Successful People Become Even More Successful by Marshall Goldsmith
This is a book for successful leaders, how they built up habits that are pretty bad, and how to change those bad habits.
Making Data Teams Greater Than The Sum Of Their Parts
SVP, Head of Data and Insights at New York Times
High performing teams:
Achieve their goals
Stick around
To do this they:
Communicate clearly
Trust and support each other
Have the right set of skills within the team
Have complementary skills (as individuals) and make user of them (as a team)
Team Values :
Independence
Integrity
Curiosity
Respect
Collaboration
Excellence
How to hire for values:
Ask how they solved a particular problem
Listen to their response, are they bluffing
Were they interested in the solution, or just wanted to get the job done
Did they include the right people, in the solution and communication
Building a Data Culture at a High-growth Startup
Director of Data & Analytics at Cribl
Building a data culture is an ongoing process and you can look at it from the perspective of a data philosophy.
Data Philosophy
Data is a Product
Customer’s First
If you create a report and no one uses it, does it provide value
Philosophy in Action
Communication
Over communicate, this gives the visibility and allows stakeholders to see what tracks.
Public channel
Announcement
Documentation
Absolutely necessary, try to dedicate 30min/week, makes answering questions easier for both parties, include last-updated timestamp, include standardized terminology and have stakeholder sign off
Guides
Definition docs
Transparency
Every employee has access to data, ticket system for requests and prioritization so they know the SLA, ask the stakeholder to select prioritization if there are too many requests
Data for all
Prioritization
Q&A Time!
How do you convince people to adopt your data culture?
Generally data folks tend to be less bias, but it can be a little harder if you have comfort using a particular tool. When you have stakeholders that want to use different technologies, it is important to investigate if that tool fits and where that tool fits. Leadership values are very important in adopting data culture, support means everything.
Building trust around data in a fast growing scale up business
Global Data Governance Lead at Contentsquare
Data Governance Strategist at Contentsquare
Upside: skyrocketing growth, so need for near real time business performance monitoring
Challenge: teams, processes, data evolving and growing extremely fast
Problem: manual data checks, time consuming, low efficiency on internal reporting, dashboard downtimes
How are we cracking and solving the problem?
Our purpose is to make data accessible understandable and reliable
Little spoiler, the solution is…
… to treat every data team output as a data product.
Execution of Data Governance and Data Quality
Data prioritization
Data catalog and data lineage
Data protection
Data quality
Data policies and processes
To get your data in a good place with data quality, you can identify bad data by conducting data quality checks. This can be done by implementing data quality rules which involve monitoring, alerting, and a process to fix the data.
This can come with a few challenges, including alert fatigue and a lack of ownership and user engagement.
Key Takeaways:
Clear ownership in Data Governance can be a game changer for data quality
Make data collection side accountable (operations, information systems), not only technical/data teams are responsible for data issues
Breakdown your big issues (high volume of alerts = zero engagement from the business side)
Balance value creation and foundations to yield performance reporting scalability
9 Predictions for Data in 2023
Venture Capitalist
Product, Monte Carlo
Cloud Manages 70% of Data Workloads by 2024
Data Workloads Segment by Use
Metrics Layers Unify Data Architectures
LLMs Change the Role of the Data Engineer
WASM Becomes Essential in Data
Notebooks Win 20% of Excel Users with Data Apps
Cloud-Prem Becomes a Norm
Data Observability
Decade of Data Continues
Q&A Time!
What do you think the impact of synthetic data will be? where will the value be - data amplification or privacy?
Synthetic data is creating data.
This would provide value with privacy and computer vision use cases.
Privacy use cases: since capturing and using PII includes a lot of restrictions, it would be easiest to create anonymized data to use for the models (like fake credit card and social security numbers).
Computer vision use cases: when trying to train self-driving cars during bad weather conditions, there are not a lot of hours of data for extreme situations.
Which one of your predictions are you most confident of? Which one will have the largest impact?
Tomasz has put together these predictions for about 5 years and averaged a success rate of about half!
Tomasz believes the cloud prediction is the most straight forward
Mei likes #8
The Role of Decision Bias at Data-Driven Companies
Nobel Prize winning economist and author of Thinking, Fast and Slow and Noise: A Flaw in Human Judgment
CEO and Co-Founder, Monte Carlo
Daniel got the chat really excited, many were star struck.
Barr asks Daniel all the questions we are dying to know!
Intuition is flawed and needs to be controlled
How do you define human intuition?
Daniel says that the standard is kind of a biased definition. It is knowing something without knowing how or why you know it. He mentions that it is biased because to know something it has to be true.
He says intuition is best defined as believing you know something and now knowing why you believe it. He also mentions that many of our intuitions are on the mark, some are poor.
What does poor intuition look like?
He shares an example from his book:
Imagine a young woman graduating from university.
Fact: she was reading fluently when she was 4 years old.
Question: What is her GPA?
What is striking is that everyone has a number in mind and a pretty good idea, not 4 but probably better than 3.8. How did that number get generated? It was actually an intuitive number as it just came to us.
Some may think that they have a good idea when you hear about a young person, you have an idea how clever they are at a young age. You can put a percentile number to that, probably high to mid 90th percentile. GPAs can be related to the same percentiles. However, that is DEAD WRONG.
You cannot have a best guess at all about her GPA, because there is very little information and lots of things could have happened to her since she was 4 years old.
The proper way to guess is to figure out the average GPA at the university and base it on that.
What is the role of intuition when it comes to AI and ML?
The interesting thing with ML is that the output of ML is much like intuition. It is a black box and you don’t know where the prediction came from. You are not sure how all the data was used to get to the prediction. This looks very much like intuitive predictions.
Something else interesting, AI predictions are much more accurate than people's intuitions. They just use the data better than what humans are capable of.
What can ONLY humans do?
AI is a baby, it is only about 10 years old, this is just the beginning. You can’t see the limits; therefore Daniel said that he can’t imagine what humans can do that is beyond AI. What he does know is that what we have in our head is a computer. Therefore, other computers can simulate what we can do. However, it is going to take a long time.
He goes on to say that AI will not be like humans, it will be better than humans. In the foreseeable future, humans are going to be in charge, and try to control that beast. But in a few decades, the capabilities of AI will increase greatly and a lot of people are worried about that.
What is the difference between system 1 and system 2 thinking?
There are two ways ideas come to mind
Question: 2+2 > Answer comes to mind without a second thought
Question: 17*24 > nothing immediately comes to mind but you can generate an answer if you think hard and long enough.
Thoughts happen to you (system 1), intuitive thoughts
More controlled, generate, more control, reasoning (system 2) like preparing your tax return or reading a map.
Daniel shares an interesting problem with the interaction between system 1 and 2 thinking. He says that a lot of the time we tend to think we are all system 2, that we are a conscious and controlled person aware of everything that we do. But a lot of times we are just accepting our system 1 thinking.
Puzzle Example:
A bat and a ball cost $1.10 in total.
The bat costs $1.00 more than the ball.
How much does the ball cost?
People often fail, system 1 generates an answer and system 2 just accepts the answer. Only some people slow down, are skeptical, and confirm whether the answer is correct. This predicts a lot about someone and their style of thinking.
Can you train yourself to think a certain way, with a certain system?
It is difficult to improve the way that you think. You can recognize situations where your intuitive thinking is wrong. If you recognize this, you should slow down in those situations.
Be suspicious of your own intuition
The Future of Data Observability
Head of Products at Monte Carlo
Head of Product Marketing at Monte Carlo
Biswaroop and Jon closed out the conference with a big announcement on the future of data observability, introducing the Data Reliability Dashboard. This tracks data platform user metrics, number of data quality monitors, table uptime, and recent incidents, providing key indicators of data health over time.
The Data Reliability Dashboard will focus on three main areas that will help leaders better understand the data quality efforts that are happening in their organization:
Stack Coverage
Data Quality Metrics
Incident Metrics and Usage
The team at Monte Carlo also announced two other new capabilities that we’re excited about, including:
Visual Incident Resolution
Integration with Power BI
Another great conference with great people, great learnings, and great music!!
Comments