Data-driven Dreams Vs. Reality

Data-driven
dreams
vs. reality

Jacek Chmiel

Director of Avenga Labs

June 17, 2021

12 min read

Data and AI

Data-driven dreams

We all keep on hearing and reading about the data-driven future of organizations. The time when all the decisions will be based on data and models, and information will be available immediately on any device.

Machine learning will help to discover information about the customers, products and markets that will enable organizations to gain a competitive advantage. On the other hand, AI is to deliver much better customer experiences than was possible before.

Who wouldn’t want all of that?

The topic is unfortunately overhyped for multiple reasons. We are bombarded with news about the successes, however, the failures are frequent but there’s a deafening silence regarding them, and we cannot even properly learn collectively from any past mistakes. We are presented with totally irrelevant examples; for instance, from the internet giants with totally different business models and availability of data. Usually, we are unable to find out how many trials and errors there were and what the cost was, resulting in us just seeing the surface of the issue.

This all creates the very strong fear of missing out. The others seem to benefit from it, as we see so many successes every day with more and more ready to use products, and nobody wants to be behind the curve.

Yet, from 70% to 80% (depending on the survey) of decision makers are not satisfied with the outcomes of their data and machine learning projects.

What can be achieved with current technologies could be truly great for business and bring a lot of value, however data projects are usually much harder, and more time and resource consuming than anticipated. Why is that? How do we minimize the risks?

Reality

Unrealistic requirements

We’ve covered the internet and optimistic news without context thus creating a lot of hype at the primary source. But, there are other sources.

There’s a saying that ‘nothing can be worse in a data project than the conversation with business people after they’ve just returned from a product conference’. Their heads and hearts are so filled with promises and they are so hyped up about the product that they are frustrated when they are faced with the reality of local data, the business model, data quality, etc. Technology people are also frustrated because they are the messengers that bring the bad news and disappointment.

There are many roads that lead to unrealistic expectations. But, what do we mean by unrealistic? Sometimes they seem very realistic, at least for non technical people who are not really familiar with the reality of technologies.

Accuracy problem

For instance let’s use the example from a very popular machine learning application which is pattern recognition. It can be used, for instance, in fraud detection. Let’s imagine the company has one million customers and ten times more transactions every month. The fraud detection solution, which was just designed and developed, has an effective accuracy of 95%. Data scientists are happy because they started with poor data that was low in quality and availability, and then tuned the models as best as they could; it is also almost as good as it can be with the current state of ML technologies.

But what does the last five percent mean? They are so called false positives, as they detect fraud when the customers haven’t done anything wrong. In this case, it means 50 thousand customers and probably half a million false alarms. There was a feature planned that would automatically block access to the accounts and send text messages with warnings, but with this huge number of cases the business may decide not to offend customers by accusing them of wrongdoing and forget about the automation of fraud processing.

And that as a result, may jeopardize the efficiency benefits of the solution, as more human labor may be required to solve the false positives than without the automated detection and handling, which defeats the purpose from the business perspective.

Obviously, there are cases when even 90% would be great. For example, there are health applications which can ask the user questions and take photos of their skin in order to try and assess the probability of skin cancer. Even 10% false positives is not a problem here, as it is just a preliminary check and people are encouraged to visit the doctor who can perform a thorough examination, give a diagnosis and prescribe treatment. Even with this 90% they can save many lives and it is clearly beneficial.

On the other hand, a security system opening a door based on voice and face recognition with 90% accuracy would certainly be rejected. For instance, Apple claims their facial recognition feature that is used to unlock their iPhones is wrong in only one case out of ten million (so 99,9999% of accuracy). Fingerprint recognition on smartphones has an accuracy of 99,99%.

In the case of digital marketing, the advertisement (banner inside the article for instance) reaching the target audience with 90% of relevance would be ok, as the remaining ten percent would probably be just a little annoyed.

So the number one advice is to think about the business consequences of the remaining percentage of wrong cases.

It’s better to check as early as possible if they can be handled in other processes so the benefits will outweigh the problem.

Big bang approach

We have already celebrated twenty years of Agile. How much it diverted from the original idea and if it is relevant is a topic for another discussion.

The good thing is that agile promoted (not invented) the iterative approach and the idea that the project team does not know everything before they start working on the project (typical for a pure waterfall approach).

Yet, we observe data projects in the waterfall fashion, which arrogantly assume the perfect prior knowledge about all the data sources, their availability, quality and the expected output parameters (including accuracy mentioned before). In the situation of data projects, such assumptions are usually further from reality than in the situations of software development projects.

Especially machine learning projects that are almost never right the first time; even top gurus of machine learning in the world admit this. There’s at least several attempts, then tuning and optimizations to achieve acceptable results. Yet, still too many decision makers believe in miracles instead of embracing the reality of such efforts. Data exploration by data scientists is also a key part of work that does not produce direct results, but it’s a necessary prerequisite to understanding the meaning, structure and statistics of data. And, that knowledge will save a lot of time and effort in the next phase.

And let’s not forget that there’s a new data management strategy called data mesh, which builds on experience with data warehouses and data lakes.

Checking assumptions about data, goals, and methods in smaller and separate iterations, or even parallel proof of concepts, can better re-define the goals and expectations, making them achievable and realistic.

Data lakes make it possible to separate issues of data discovery and availability from different experimentations in data analytics and changing requirements.

Focus on data visualization

Data visualisation is very important as it is the interface between all the complex pipelines and analytical components, making the final results useful.

We’re all spoiled by nice graphics, infographics, interactivity and colorfulness. It’s definitely become a standard requirement for data projects.

The problem is that it is the easiest and most visible part, and too many discussions are focused on it at the expense of … much more complex matters. For example, how to deal with errors in data or which data is really required for whom.

→ Dashboard overload? Long live digital stories and notebooks!

Too many data projects end with nice data visualizations which … nobody trusts and thus they are not used except maybe for proving that the project is “completed”. Yes, all the parts are there and technically it’s working, but from the business point of view it is a failure because there’s a lack of trust.

The advice here is to make sure that conversations about visual things do not occur at the expense of less visible but more important issues.

Data availability

It seems to be somehow obvious that organizations know what and where their data is. However, having an up to date and useful data catalog of the entire enterprise is not as common as one might think. There’s always something similar to that but it’s usually not complete, not up to date and not accurate enough. It might be a more or less good starting point but not something we can take for granted.

Even so, they usually represent the static view of data. Data lineage on an enterprise scale is still a dream that hasn’t come true and probably never will. Lots of answers about the meaning of data comes from their dynamic aspect of usage and modification. Even partial data lineage is a great help in understanding data.

Data ownership is another very often overlooked issue. The assumption of having easy access to all the needed data may be even overoptimistic inside a company. For instance, formal ownership (“all the data that belongs to the company”) may not prevent different departments from being unwilling to share their data with others, by defending it as too vulnerable to be shared elsewhere within the company, and these protests and delays may impair and even block data discovery processes.

It’s much more difficult with external partners.

And let’s not forget about privacy and AI regulations. They are only getting stronger and there will be more of them coming.

Fortunately there are new technologies, such as federated learning, that enable machine learning for scenarios where the company cannot access all the data and thus preserve data protection and privacy which is critical.

→ Federated learning as a new approach to machine learning

And to add something to this, even publicly available data sources may not be allowed to be used for business purposes; and almost never are they ready to use out of the box nor have perfect data quality.

Let’s imagine we have all the data readily available without any hiccups. Still, for machine learning purposes, the company might simply not have enough of it. This means lower accuracy of the models and a much worse outcome of the project.

The advice here is to not underestimate data availability issues. It’s better to check all these aspects before the design and implementation of the project, as it might save a lot of time.

There is also a rise of data generation techniques (including GANs) which are useful to augment real data with artificial data that has similar characteristics to help build better ML models.

→ Our take on Generative AI – creative AI of the future

Data quality

The trust in the final result is the key for true business success of a data project.

No data is perfect in any organization. So, there’s always a big part of data projects related to data quality assessment, cleansing, augmentation and transformation. There are different estimates about the average and some reach up to 80% of the overall workload of the projects.

→ Essentially, Data is good. It’s the use cases that can be problematic.

Data quality management should not be part of a data project, but part of the process that takes place all the time for all data sources, constantly detecting data quality issues and fixing them. It’s not always the case unfortunately, as it does exist but sometimes it resembles more a facade activity than real data quality management.

We always recommend starting as early as possible and starting from the data sources themselves.

→ Look at Data quality at the source pattern

Not all the data sources can be controlled by the organization, so data cleansing and augmentation can also be very useful at later stages in the data processing pipeline. Properly implemented Master Data Management within the organization may be very helpful at this point.

→ A closer look at Data Governance

Data quality should be measured not just as a one or a zero (meeting the requirements or not), as there can be more shades of gray and threshold values which may be established to know when the data is good enough for the purpose of the project.

Trying to achieve perfect data quality can turn out to be not feasible at all, so the question should be what data quality will still allow us to achieve our business goals.

Resource shortage

Machine learning is the most popular subject today among IT students, according to many statistics. So many people leave their universities with the knowledge of underlying math, methods and tools, but it will take some time for them to get enough working skills and experience to be really beneficial.

The data space is very much fragmented with hundreds of tools, even for traditional ETL and reporting activities. So the people who know them well are always in great demand.

In the case of enterprise software development, it is dominated by Java and DotNet where the competences are easier to obtain unless the project is in a very niche technology.

Next steps

This article just scratches the surface of the real life image of data projects with a few examples and pieces of advice, which always depends upon the context and goals of the project.

The positive here is the fact that we all are learning more and more about how to improve the efficiency of data projects and thankfully there are known issues and solutions for us.

The key takeaway is that these types of projects require experience and being careful, along with skills in data tools and domain knowledge.

All this is what the Avenga data team delivers, and with over twenty years of experience we have every capability to justify your expectations.