On the other hand, a security system opening a door based on voice and face recognition with 90% accuracy would certainly be rejected. For instance, Apple claims their facial recognition feature that is used to unlock their iPhones is wrong in only one case out of ten million (so 99,9999% of accuracy). Fingerprint recognition on smartphones has an accuracy of 99,99%.
In the case of digital marketing, the advertisement (banner inside the article for instance) reaching the target audience with 90% of relevance would be ok, as the remaining ten percent would probably be just a little annoyed.
So the number one advice is to think about the business consequences of the remaining percentage of wrong cases.
It’s better to check as early as possible if they can be handled in other processes so the benefits will outweigh the problem.

Big bang approach
We have already celebrated twenty years of Agile. How much it diverted from the original idea and if it is relevant is a topic for another discussion.
The good thing is that agile promoted (not invented) the iterative approach and the idea that the project team does not know everything before they start working on the project (typical for a pure waterfall approach).
Yet, we observe data projects in the waterfall fashion, which arrogantly assume the perfect prior knowledge about all the data sources, their availability, quality and the expected output parameters (including accuracy mentioned before). In the situation of data projects, such assumptions are usually further from reality than in the situations of software development projects.
Especially machine learning projects that are almost never right the first time; even top gurus of machine learning in the world admit this. There’s at least several attempts, then tuning and optimizations to achieve acceptable results. Yet, still too many decision makers believe in miracles instead of embracing the reality of such efforts. Data exploration by data scientists is also a key part of work that does not produce direct results, but it’s a necessary prerequisite to understanding the meaning, structure and statistics of data. And, that knowledge will save a lot of time and effort in the next phase.
And let’s not forget that there’s a new data management strategy called data mesh, which builds on experience with data warehouses and data lakes.
Checking assumptions about data, goals, and methods in smaller and separate iterations, or even parallel proof of concepts, can better re-define the goals and expectations, making them achievable and realistic.
Data lakes make it possible to separate issues of data discovery and availability from different experimentations in data analytics and changing requirements.
Focus on data visualization
Data visualisation is very important as it is the interface between all the complex pipelines and analytical components, making the final results useful.
We’re all spoiled by nice graphics, infographics, interactivity and colorfulness. It’s definitely become a standard requirement for data projects.
The problem is that it is the easiest and most visible part, and too many discussions are focused on it at the expense of … much more complex matters. For example, how to deal with errors in data or which data is really required for whom.
→ Dashboard overload? Long live digital stories and notebooks!
Too many data projects end with nice data visualizations which … nobody trusts and thus they are not used except maybe for proving that the project is “completed”. Yes, all the parts are there and technically it’s working, but from the business point of view it is a failure because there’s a lack of trust.
The advice here is to make sure that conversations about visual things do not occur at the expense of less visible but more important issues.
Data availability
It seems to be somehow obvious that organizations know what and where their data is. However, having an up to date and useful data catalog of the entire enterprise is not as common as one might think. There’s always something similar to that but it’s usually not complete, not up to date and not accurate enough. It might be a more or less good starting point but not something we can take for granted.
Even so, they usually represent the static view of data. Data lineage on an enterprise scale is still a dream that hasn’t come true and probably never will. Lots of answers about the meaning of data comes from their dynamic aspect of usage and modification. Even partial data lineage is a great help in understanding data.
Data ownership is another very often overlooked issue. The assumption of having easy access to all the needed data may be even overoptimistic inside a company. For instance, formal ownership (“all the data that belongs to the company”) may not prevent different departments from being unwilling to share their data with others, by defending it as too vulnerable to be shared elsewhere within the company, and these protests and delays may impair and even block data discovery processes.
It’s much more difficult with external partners.
And let’s not forget about privacy and AI regulations. They are only getting stronger and there will be more of them coming.
Fortunately there are new technologies, such as federated learning, that enable machine learning for scenarios where the company cannot access all the data and thus preserve data protection and privacy which is critical.
→ Federated learning as a new approach to machine learning
And to add something to this, even publicly available data sources may not be allowed to be used for business purposes; and almost never are they ready to use out of the box nor have perfect data quality.
Let’s imagine we have all the data readily available without any hiccups. Still, for machine learning purposes, the company might simply not have enough of it. This means lower accuracy of the models and a much worse outcome of the project.
The advice here is to not underestimate data availability issues. It’s better to check all these aspects before the design and implementation of the project, as it might save a lot of time.
There is also a rise of data generation techniques (including GANs) which are useful to augment real data with artificial data that has similar characteristics to help build better ML models.
→ Our take on Generative AI – creative AI of the future

Data quality
The trust in the final result is the key for true business success of a data project.
No data is perfect in any organization. So, there’s always a big part of data projects related to data quality assessment, cleansing, augmentation and transformation. There are different estimates about the average and some reach up to 80% of the overall workload of the projects.
→ Essentially, Data is good. It’s the use cases that can be problematic.
Data quality management should not be part of a data project, but part of the process that takes place all the time for all data sources, constantly detecting data quality issues and fixing them. It’s not always the case unfortunately, as it does exist but sometimes it resembles more a facade activity than real data quality management.
We always recommend starting as early as possible and starting from the data sources themselves.
→ Look at Data quality at the source pattern
Not all the data sources can be controlled by the organization, so data cleansing and augmentation can also be very useful at later stages in the data processing pipeline. Properly implemented Master Data Management within the organization may be very helpful at this point.
→ A closer look at Data Governance
Data quality should be measured not just as a one or a zero (meeting the requirements or not), as there can be more shades of gray and threshold values which may be established to know when the data is good enough for the purpose of the project.
Trying to achieve perfect data quality can turn out to be not feasible at all, so the question should be what data quality will still allow us to achieve our business goals.
Resource shortage
Machine learning is the most popular subject today among IT students, according to many statistics. So many people leave their universities with the knowledge of underlying math, methods and tools, but it will take some time for them to get enough working skills and experience to be really beneficial.
The data space is very much fragmented with hundreds of tools, even for traditional ETL and reporting activities. So the people who know them well are always in great demand.
In the case of enterprise software development, it is dominated by Java and DotNet where the competences are easier to obtain unless the project is in a very niche technology.
Next steps
This article just scratches the surface of the real life image of data projects with a few examples and pieces of advice, which always depends upon the context and goals of the project.
The positive here is the fact that we all are learning more and more about how to improve the efficiency of data projects and thankfully there are known issues and solutions for us.
The key takeaway is that these types of projects require experience and being careful, along with skills in data tools and domain knowledge.
All this is what the Avenga data team delivers, and with over twenty years of experience we have every capability to justify your expectations.