Podcast 🎧: Speaking about data with Mikkel Settnes

data-science-data-platform-mikkel-settnes.gif

Dreamdata’s Lead Data Scientist Mikkel Settnes sat down with Data Science at Home host Francesco Gadaleta to discuss all things B2B Attribution.

In the episode they cover:

  • Why data professionals spend too much time on data wrangling and how Dreamdata helps.

  • The characteristics of B2B data (and how it differs to B2C)

  • How the models employed by Dreamdata help ask better ‘why’ questions.

  • Why Bayesian models are especially useful in this context.

  • Data privacy and taking control of your own data collection.

    and much more…

 
 
 




Transcript

Francesco:

Welcome back to another episode of Data Science at Home Podcast. I'm Francesco, podcasting from the regular office of Amethix Technologies based in Leuven, Belgium. Today, I'm not alone. I am with Mikkel Settnes, lead data scientist at Dreamdata.io. Hi, Mikkel. How you doing?

Mikkel Settnes:

Hi. I'm doing fine. Thank you. Happy to be here.

Francesco:

Yeah, indeed. It's a pleasure. I was saying, I saw your company online and the name, first of all, spot on, Dreamdata.

Mikkel Settnes:

Right. Exactly right. That's what you want.

Francesco:

Really cool. Also, the tagline was something that really attracted me, that goes like, "Connect, analyze, scale, repeat." That's really, really an interesting one. So, what is Dreamdata? What do you guys do?

Mikkel Settnes:

In a nutshell, we are a revenue analytics and attribution platform mainly for B2B companies. That means that we are collecting, transforming, cleaning, and modeling all the data that is relevant for doing revenue analytics for a B2B company. So we collect data from across the entire tech stack, that being at ad platforms, marketing automation, product usage, then we organize it and make it ready for analytics and modeling. I normally see us as a first wave of more wide domain-specific tools, because you have all these general purpose tools coming out, solving one specific data niche. The problem of those tools are that they can't really solve the data preparation type problem, because that is usually very domain-specific by nature for different industries and especially for different business models.

Mikkel Settnes:

So when that Dreamdata focuses on B2B, it allows us to do most of this data manipulation, data wrangling and things like that for our customers so they get it out of the box and don't have to stop a big tech stack for doing this. I mean, we all heard this saying about the 80% of data professional's time being spent of data wrangling. I think we all heard that.

Francesco:

Exactly. I wanted to interrupt you, saying that's exactly one of the most time-consuming tasks of the data analytics pipeline, indeed, the wrangling, the cleaning. Well, first of all, choosing the right data for the right thing, right?

Mikkel Settnes:

Exactly. And getting it out of the system, where it was and how does it then match together with the rest of the data that you have, because otherwise it's hard to analyze things in one go if they don't match. So, what I usually say that we're able to leapfrog our customers from a state where data is scattered all over different business tools and get into a situation where data is organized, cleaned, and standardized, and then ready for where the fun begins, you can sort of say.

Francesco:

Indeed, I agree with that. I strongly believe actually that you guys do a lot of important and very interesting machine learning models, and we'll speak about that in a few minutes, in a few moments. But before getting there, what are the type of customers? What type of customers do you guys usually deal with?

Mikkel Settnes:

Yeah. So I already mentioned the B2B focus, as that data is quite unique compared to the B2C case. So, most of our customers are SaaS companies. They are between, I don't know, 50 to a couple of hundred employees. I don't think we necessarily have a specific industry for these SaaS companies, but a very common denominator between them are that it's companies that are either growing fast or looking to grow very fast. So they are desiring this opportunity to leapfrogging ahead and say, "Okay. It's fine that you need to set up. We acknowledge that we need to collect and clean and perform and orchestrate all these touchpoints we have for our customers, but we don't want to hire a team of 10 people to do that and set that up. We want to get right where we can reap some value, because we need to grow at a certain rate."

Francesco:

Yeah, that makes sense. I mean, these companies, I believe they want to focus on their business, which clearly is not wrangling data, is other things.

Mikkel Settnes:

Exactly.

Francesco:

Yeah. Okay. So they want to automate as much as they can the most tedious parts and components of a data pipeline, essentially.

Mikkel Settnes:

Exactly. And then, of course, it varies from... Some of our customers are interested in our... We, of course, have some out-of-the-box analytics and models to look at, but we also offer the data free of charge so people can go in and continue the work with that data. So say that you want a very specific machine learning model build on that data, then you can just take that data in your data warehouse and continue your modeling process without worrying about all the orchestration of getting the data collected and cleaned in that states ready for the model.

Francesco:

Pretty cool and definitely super useful, especially for those who are not manipulating data as a daily job. Okay. Mikkel, you mentioned already B2B a number of times. The question is really natural for me. We hear a lot of B2C, B2B and the data really is quite diverse, quite different between the two. Now, of course, there are a million differences between a B2B and a B2C type of business. But my question to you is when it comes to data, what's probably the difference that is really worth mentioning today?

Mikkel Settnes:

Yeah. So, great question. I guess it all boils down to the different nature of the business or how you sell in the different cases. So B2B is perhaps a little bit more digital than B2C. Most B2B companies would not advertise in normal television, for example. Also, the safe cycles are much longer compared to the normal consumer goods, and you also have more people involved. It's not a single person making the purchase in the end. And then, of course, you probably could also say that you have less impulsive shopping in the B2B world. It's harder to make people, to make them see an ad and then five minutes after, buy the product. That's probably easier if you are selling a shoe or something consumer goods.

Mikkel Settnes:

That, of course, makes the data different, because even our largest customers, and we have companies that are unicorn size, that they have many customers. They are not even close to getting the same amount of close deals as Amazon would have when they're selling some kind of shoe to people or people choosing a movie on Netflix or ordering an Uber. It's not even a contest. It's way fewer. It's not just a matter of, "Oh, they need to be a bigger company." They will never reach that amount of close deals. So in the day where I spent most of my time, we are compensating this lack of very many deals or many sales. You're compensating that by having a good data quality. And then you're trying to achieve a very horizontal data set compared to the very vertical data set that you have in the B2C world.

Francesco:

Right. That was exactly where I was going. The volume of data sometimes, actually, many times, determines the methodologies and the type of analytics that you can do, of course, and yet there a dilemma or, well, there is this major difference between, for example, data-centric AI and model-centric AI, that is pretty much determined by how much data you have, like the volume, the actual volume of the data that you have at your disposal. So, is that reflected in your business as well?

Mikkel Settnes:

It definitely is. It's basically when you do these things, though you have a million users, that's what you would call, then you get the last data. Whereas in many of our customers, you feel like you have a lot of data because you have a lot of different information on each deal, the easy thing you sell. You have a lot of different stakeholders that were doing something. They have a lot of interactions with different tools, but it's still only one deal. So we have very wide data, so many different touch points along that, that you have then collected carefully, but they're still only one of them, and that leads me to a very big point of this, is that you need methods that are suitable for these wider data sets compared to what most examples of AI and data science. It comes from the big tech companies. It comes from the ones that are pioneered by very vertical data, where we have a lot of different objects, but maybe fewer interactions with each object.

Francesco:

Indeed. I mean, the Google of the world and the Facebook of the world are just unique entities, in fact. And so, probably the only ones that in these times can, for example, retrain a massive deep learning model on terabytes of data if they want, because they have it, they own, but there are a million other companies who don't have this data at their disposals. And so, you already mentioned partially, they have to think something different. They have to do something, some kind of analytics in a, let's say, different way. Definitely, they cannot go with a massive deploying training strategy because they would not have the most important ingredient, which is data. And so, what are the typical challenges at Dreamdata when in front of you, you have clients or users or, well, companies with limited volumes of data? What happens in those cases?

Mikkel Settnes:

Yeah. Usually, that was when I say that's where one, you need to have a very strong foundation on how to collect that data. You need to be very strict that you're doing it right, because each data point is then a larger portion of your entire dataset. So you need to be a little bit more careful in that sense, but it's also about the methods that you're choosing and the questions that you're able to ask. So the different models, maybe sometimes the math could be perceived as simpler compared to a huge neural network, but it's also models that then pursued gives another opportunity mainly to ask better why questions, because the math might be simpler, but the features then becomes much more important. Maybe sometimes you actually need to be weary a little bit more about what data you put in. You can't expect the model to figure it out by itself, short noise from signal. You actually need to guide it a little bit more, meaning that you are relying on well-engineered features much more than you would in a huge neural network.

Francesco:

I'm so glad that you actually said this because when deep learning came out, there was this misconception, indeed, that deep learning would've solved all the problems that would have indeed created or found these features, basically the feature engineering task completely automated. Anyway, there were people believing in that. Now, that's true only for certain types of problems.

Mikkel Settnes:

Exactly.

Francesco:

We have seen probably computer vision in probably one of them. But there are many, many other problems that I believe that the Dreamdata, you guys deal with more the other type of problems, where, indeed, automating feature engineering is extremely difficult, if not impossible. And so, you need to be creative. You need still manual intervention. You need human brains in action rather than deep learning, right?

Mikkel Settnes:

Yeah, exactly. It's also the part of questions in the realm where I'm from, where it's attribution. It's a lot about the why. It's almost as important to get the model to explain it's self as it is to hit a very accurate prediction. Because as we talk about in the beginning, our catchphrase of figuring out what works and then scale that, that is exactly the part where it might be that I'm not hitting the best prediction, but I'm hitting what actually works and what drives the sales I'm doing, or the marketing campaign, or product usage, et cetera, et cetera. That's a little bit of a different ball game that's actually maybe better solved without these huge neural networks that are for many purposes, black boxes.

Francesco:

Exactly. Exactly. We have seen this many times also in healthcare, for example, where it's better to understand what are the drivers of a certain disease or a biological compound, et cetera, et cetera, rather than having 99.999% accuracy and zero explanation of what's going on down there.

Mikkel Settnes:

Exactly. If it's a decision tree, it's almost as your feature importance almost becomes the output because that's what you're really interested in. I always have this funny example, where people saying, "Okay. I want to build a model on how to get more revenue in my store that sells something." That's an important question. "I want to build a model that can explain the amount of customers that I have now. And then the problem is I might get a very good model, but if I model uses the number of people that are heading to the register with their credit card in their hand, if that's a feature of my model, it's not really going to tell me anything other than to get more cost, to get more revenue." You need to get more people to go towards the register with a credit card in your hand, which is next to useless information.

Mikkel Settnes:

In that sense, it's much more important to get the features that you're actually able to impact to the feature that will drive your decision and figure out which one of those are actually important. In that sense, it's a little bit weird that healthcare is a good example of a similar thing compared to sales and marketing, but it is the same drivers. Here, it's just not the health, it's cold cash. You don't want to throw your money and your energy after something that you don't understand and don't know if it works. You want to throw them after something where you can understand why the model is telling you to do A or B.

Francesco:

It makes perfect sense. So, let's try to switch gear a bit here because I'm getting excited. So, what are the typical methods or the methods that you at Dreamdata are most mostly busy with when it comes to low volumes of data, explainability and, of course, human engineered features?

Mikkel Settnes:

Yeah. So of course, you'll start out as anything with simple models, models that you could say are a little bit bookkeeping-ish, because you're tracking what went on. You're assigning weights to it. This is a very important first step also to make sure that your data makes sense, that there's not something hugely important that you're missing. These are the one models that you'll typically hear refer to as multi-charge attribution model. Then, you can get a step further and be a little bit more clever about how you are counting stuff. Then, you're entering up at Shapley value models. If people haven't heard it already, they should go back to your old episode and hear you talk about Shapley values, highly recommend that.

Francesco:

Thanks for the advertising.

Mikkel Settnes:

Yeah, exactly. So, that's a more advanced form of counting, but it all boils down... So, here you're really seeing are the features you have, are the things you're counting the right ones? That will really show with these models, because they are very easy to interpret. Then, you go a step further saying, "Okay. Now, you're not counting just the deals that was closed. You are also counting the ones that are not closed." So now, you got one step closer to... What you call this? More classification type problem.

Mikkel Settnes:

This is where you're usually using things, like macro models or survival models, survival analysis models, which are, again, something that is usually taking from biology, I think, which, again, as this distinction that you're actually able to, in a simple way, explain something because you already sort of encoded way things work. You already encoded which features are affecting each other. And then, of course, you end up in models even more advanced, which will be various form of classification models, which multivariate regression models, where I usually like to emphasize them, where the right one is to do Bayesian models here, but that has a very specific idea why Bayesian models are exceptionally good at these things. It comes with the nature of Bayesian models.

Francesco:

So Bayesian model, let's try to give a brief intro. I mean, of course, it's going to be impossible to introduce Bayesian statistics in one episode.

Mikkel Settnes:

Yeah. Then, I'll return in another episode talking about that.

Francesco:

Maybe another two.

Mikkel Settnes:

Yeah, exactly. Generally, Bayesian models or Bayesian statistics, it's models that derived from Bayes Theorem, which is that this very old piece of math that suddenly became exciting again with the advent of more computational power. But what you'll then see in it, it's very similar to the normal statistics, but you have this concept of a prior. This prior is basically like an extra set of data. So now, you're not just asking the normal machine learning question that given my data, how accurate is this model? You're saying, "Given my data and my prior knowledge, how accurate is this?"

Francesco:

Right. It could also be my belief about a certain phenomenon, right?

Mikkel Settnes:

Exactly. That's the part that makes it powerful in a sense where you have wider data, because usually that's a structure to that data and that you actually want to inform the model about, and that's where it becomes important because normally you see... Because we've all seen these big neural networks with vision, where then it learns stripes and then it learns dots, and then it learns different things in different layers. For the big neural nets, they learn it by themselves because you have a huge amount of data to do it on. When you have less data, you still want to have those kind of learnings, but now you have to come with them. Sometimes it's actually even advantageous in a business perspective because you can then encode the business knowledge, the business analyst knowledge into your model. So the model will actually inform you, "Okay, you thought it was here." Yeah, you're almost right. It needs to go in this direction, but the model cannot shoot off to some weird thing.

Francesco:

Yeah, exactly. So in fact, you can encode... I'm trying to rephrase here for those were probably less technical than us. You are trying to encode the knowledge of, for example, the domain experts. So instead of firing them and say, "Hey, we have a deep learning that solves your problems. Goodbye." You'll say, "No." The last 20-plus years experience, we can encode that so that the neural network doesn't have to start, or whatever model, doesn't have to start from scratch. It can still use the data as evidence of that phenomenon, but I can embed, inject some prior knowledge, as you say, which is in fact, the prior information that you already know. And so, the model will save even time in training, right?

Mikkel Settnes:

Yeah. Yeah. Yeah, exactly. Exactly. You can even include the structure of your data, which is a lot in marketing. You operate with hierarchies, where there is a certain hierarchy that something is... Let's say you are an insurance company. You have damages on different cars. Then, car A is similar to car B, but they are different brands. So they are both cars, but different brands. So now, you want to encode that knowledge of data into your model. So the model actually knows that there are such things as two different type of cars, and it's similar to part where you can see this thing in a image, where now it has learned stripes. Now, you are informing it, "There is something called stripes and they look like this." So it is exactly, as you say, do encode the domain expertise into your model, which is, of course, a very domain-specific thing to do, but that's also a way to utilize this wide data to the fullest.

Francesco:

Absolutely right. Well, there is another major difference and I would like to intervene here. When it comes to Bayesian matters, Bayesian models, I'm also a big fan. So, that's why my questions are very, very aggressive in a good way, I hope. Well, the outcome is not a number. It's a distribution-

Mikkel Settnes:

Exactly.

Francesco:

... which is much easier to digest, I find. This is my personal experience, I would love to know yours, is when I try to explain to the business, it's not like 42, but it's like there is a probability of 90-plus percent that it's going to be 42. They seem to grasp this concept much, much easier than saying it's 42 and then probably it's 44. They have to measure an error. They have to do an exercise, a mental exercise that is very far from human. How did you find this in your domain?

Mikkel Settnes:

Yeah, I think it's twofold, because some people are very fond of averages and will never want to move away from an average. That's a number. I understand that. But there's also, I guess, the responsibility of data scientist to also say, "The world is not as simple as an average." I find that when I present it as uncertainties that naturally comes out of a Bayesian model, I'm also a lot more confident what I'm saying, because I actually know whether or not my model is that certain or if the average is actually just in the middle of... It could be one to 1,000 and now the average turned out to be 100, or is it 100 plus-minus one? If I was a business person, I would actually require that information because it's usually important.

Francesco:

Absolutely.

Mikkel Settnes:

I mean, if I'm going to base my spend on something, on a model, I would like to know how certain that model is. I mean, I would make it a requirement for any use of a model that I would know how certain is the model actually of this.

Francesco:

You're absolutely right. I mean, we cannot expect that these things have zero error. We know that there is error. The important thing is being aware of it and try to quantify that error as well. So that's the most important part of the prediction in my opinion.

Mikkel Settnes:

Yeah. Yeah. I really feel that and I think that's also when I started to learn Bayesian methods. Maybe it's my background as a physicist that feel that uncertainties are important, but it really nice that it comes natural out of the model. It is something that you'll always find, and it's always nice that it actually quantifies it. You don't have to do anything extra, because a lot of other models, if you want to know how certain it is, you have to do all kind of extra steps and change the dates and manipulate it, do something to figure out how certain it is, where in these models, it comes out naturally.

Francesco:

For free.

Mikkel Settnes:

It's part of the way you train the model.

Francesco:

I love it. You love it. We all love Bayesian. I mean, that's a fact.

Mikkel Settnes:

It's the new thing.

Francesco:

That's the only thing about-

Mikkel Settnes:

The new kid on the block.

Francesco:

The only thing about which we are not uncertain. All right. Well, pun apart, let's switch gear again, because I would like to touch another very important aspect of data, which is privacy, personal information, confidentiality. As we all know, there are a ton of regulations out there, not just GDPR, but GDPR is probably the tip of an iceberg. These regulations keep changing. Now, these regulations also put a lot of limitation to the companies that indeed organize, manipulate and process data, and definitely Dreamdata is one of them. So my question to you is, well, first of all, do you deal with personal identifiable information or PII?

Mikkel Settnes:

Of course, we do, and that's inevitable when you're tracking people. So you need to worry about it, not just because it's the law, but also because you need to be able to figure out who's who, and you need to be sure that you're doing that in a right ethical manner, and that, of course, you need to be careful about how you're collecting it and what you're collecting. As most people are aware, then third-party cookies are going away. I think they are mostly gone from Apple's devices. It's also coming to the other browsers now, and that, of course, makes it harder, but it also makes that you actually need to take control of your own data collection because the trick about the third-party cookies was that you didn't have to do that much as a company.

Mikkel Settnes:

I mean, then, Google would do it for you. They would follow everybody around on the web, tracking where they were. Now, Google are not able to do that anymore, but you're still able to track the people in your own job. I mean, that's still something that's legal, as long as you, of course, ask people, "Is it okay that we-

Francesco:

Right. With consent.

Mikkel Settnes:

... do this?" Yeah. Of course, you have a lot of these things that people... You'll probably see more of all these ways to get you to sign up to things, because that is a way that now you are consenting so that you are allowed to be tracked in some sense, not necessarily allowed to being sent spam emails with marketing material, but they are allowed to have your email because you gave it yourself.

Francesco:

Well, definitely, I'll call it the far west of data is pretty much in the past now. Thank to these regulations. To be honest with you, I want to give my personal opinion also. I've done that many time on this show. I do believe that the problem of privacy and confidentiality is not a technological problem. We have the technical tools to solve that problem, but it's more a political problem. It's more a human problem. It's a regulation problem. And so, all the things that you mentioned, the no tracking, no more cookies, consent-based data analytics will change. We'll keep changing the way we analyze data. We are custodians of the data or ownership of the data and conservation and transformation, for sure. Mikkel, I would like to know your team, I mean, the dream team, as I said at the beginning of the show. Who are the people behind Dreamdata?

Mikkel Settnes:

Yeah. So right now, we're still a fairly small company, but also a quite diverse company. We are located in Copenhagen. So of course, there's a overweight of Danish people, but other than that, multiple multitude of nationalities, and also different backgrounds. I mean, we have people from ranging from the more classical software engineering. We have software engineers with not that classical software engineering background. Me, myself, has a PhD in physics, in the quantum physics. We have another person that is a biologist by training. So it's a very different skills.

Francesco:

Definitely places where there are a lot of data by nature.

Mikkel Settnes:

Definitely. Definitely. Also, there is the mindset of how to use it. Also, I find that as my own background, it pops up now and again, with the interest of just knowing the why, because that's the reason I got into physics to begin with, sort of know how things work, which is translating very nice into now understanding why my model is giving a specific output, what is actually the real-world mechanic that makes it work, but I think it's important when you do a product like Dreamdata that you have different perspective and different backgrounds, because it compliments very well with people being... Thinking different things are very important, which always sense to make a better product because about different views are competing.

Francesco:

Absolutely. Totally agree with you. Diverse teams and heterogeneous teams are always the best. They always get there first. With respect to those with 100% engineers, 100% data scientists, without domain expertise, it's not going to work. Usually, it doesn't work.

Mikkel Settnes:

No. I also find my work as a data scientist is also way easier when I have these other roles to support me or I to support them the way... It depend on how you view it, because there is also something... I mean, you cannot do everything yourself. If I was to figure out how to extract something, and also how to build an app, and also how to build routing, how to set up [inaudible 00:33:13], I mean, I would never sleep.

Francesco:

Mikkel, I think we are at the end. Now, there is one other question I'd like to ask because today you were late because you were hiring someone. Are you guys still hiring?

Mikkel Settnes:

We are still hiring. So we have different positions basically across our product team. So, definitely go check that out.

Francesco:

Cool. So we're going to report definitely the link, Dreamdata.io, and some other references of the things that we have been speaking about in the show notes of this episode, as always. We also invite you to the official Discord channel. You will find a link in the show notes of this episode, and also on the official website, datascienceathome.com. This was Mikkel from Dreamdata.io, lead data scientist. Mikkel, it was a great conversation and I really enjoyed it a lot, and I'm sure that the listeners of Data Science at Home will do, too.

Mikkel Settnes:

It was a pleasure.

Francesco:

Take care.

Speaker 3:

You've been listening to Data Science at Home Podcast. Be sure to subscribe on iTunes, Stitcher, or Podbean to get new fresh episodes. For more, please follow us on Instagram, Twitter, and Facebook, or visit our website at datascienceathome.com.

Previous
Previous

Success story: How Eupry used revenue attribution to double their Sales KPIs with the same ad spend

Next
Next

What is attribution, really?