The Enormous Exaggeration and Lack of Fact Checking of Big Data Claims

Executive Summary

  • AI has engaged in the projection of human consciousness onto software since its beginnings.
  • We cover the problems with these projections.


While both AI and Data Science precede Big Data as concepts, it was Big Data that went through its recent hype cycle first (That is in recent times, excluding the previous AI hype cycles that harken back to the 1960s.)

The problems that arose from the promises around Big Data are expressed in the following quotation.

“For all of the publicity celebrating Big Data, sometimes Small Data is more useful. Instead of ransacking mountains of essentially random number looking for something interesting, it can be more productive to collect good data that are focused on the question that a study is intended to answer.

Computer algorithms do mathematical calculations consistently and perfectly because software engineers know exactly what they want the algorithms to do and write the code that does it. Not so with data mining algorithms, where the intentions are vague and the results are unpredictable.” – The AI Delusion 

This article is published in 2020 and projects a bubble burst of AI and Data Science, which are currently riding high on a massive bubble of expectations, that as this book has explained up to this point, are not based on a realistic assessment of the evidence. However, the one area that has essentially already burst in terms of its performance is Big Data.

Because there is virtually no fact-checking performed in the enterprise software market, this is not pointed out. Even though Big Data’s major tool, Hadoop, is widely known to have not met expectations, Big Data has not been widely lampooned by IT media.

One reason for this is that the promises of Big Data were simply transitioned to new areas, which are AI and data science. There are numerous problems with Big Data projects. However, one example is found in the following quotation. 

“We’re all familiar with the stories that come out of Google and Netflix about the clever insights they’ve derived from the vast drifts of bits that pile up at their doors. And I suspect a lot of us have wondered why those drifts aren’t accumulating in front of our pet projects yet, despite the fact that we’re now supposedly now living in the era of Big Data and have been for several years. Now we know why not. It’s because Google and Netflix are the huge outliers in the distribution, while almost every other organization on Earth fills in the curve that lies behind them.

What this distribution also tells us is that, rosy expectations aside, this situation isn’t going to change. Why not? Because scale-free distributions that show up in social systems are generally driven by a process called preferential attachment.

In other words, rather than the data revolution equalizing out so that big data becomes ubiquitous as we were all promised, it’s instead going to pile up in the places where we already see it. That’s because it’s basically impossible for the machinery that generated the current imbalance in global information to throw itself into reverse. At the same time, the value of data is rising, so people aren’t about to start giving what they’ve collected away for free.

This means is that all that lovely content we needed to train the clever machine learning models we’ve been building isn’t likely to show up. And those who already have enough will end up with more than they can possibly manage. The big data reality is already here, and it’s vastly uneven.

For many of us, though, scraping the internet still provides the best source of proxy training data, particularly when it comes to NLP projects. But I can’t imagine this situation will last long. As the recent causal revolution in statistics kicked off by the work of Judea Pearl gathers steam, it’ll become easier and easier to build out tools that produce synthetic data shaped to test the functional limits of any pipeline.

Big data is dead. Long live big (fake) data!” – Towards Data Science

 Big “Real” Data is by in large not realistic for most companies, but why not merely use Big Fake Data? Is that what these investments were justified based on before the sale was made?

Probably not.

Yes, while vendors and consulting companies and their compliant media entities built up Big Data, in part on the basis of benefits received by entities like Facebook that accumulate Big Data as a part of the normal functioning of their websites. This same Big Data is not available to the vast majority of companies that spent on Big Data projects.

Because of the rise in the hype cycles of data science and AI, Big Data has been shielded from a widespread acknowledgment of the failure of Big Data to meet its promised claims. And many of the claims around Big Data have been transposed onto data science. 

The Fatal Flaw with Big Data?

This is expressed in the following quotation. 

“Real data miners who unleash their data mining algorithms on big data commonly have billions or trillions of observations, and their algorithms not only look for patterns within each data set and their intercorrelations among different data sets, but even more complex relationships. They will inevitably uncover remarkable patterns but, just like this stock market example, the software cannot distinguish between causal and coincidental.

Here is another example of the perils of data mining. Data miners routinely sift through data that are tangentially related to what they are trying to predict, even if there is no compelling reason why the data should be of any real value. Suppose for example, that I want to predict tomorrow’s temperature. Real weather forecasters use complicated computer models that divide the atmosphere up into the cubes, using satellite data to estimate the temperature, humidity, wind speed, and the like for each cube. Using physics, fluid dynamics, and other scientific principles, the computer models predict how the weather will evolve as the cubes interact with each other. That sounds like work. I don’t have the resources, I don’t understand the science. Instead, I used data mining software to make weather forecasts based on knowledge discovery. Specifically, I tried to predict tomorrow’s temperature in City A based on yesterday’s temperature in City B. I could use yesterday’s temperature in City A, but that wouldn’t be knowledge discovery, would it?

A mindless data mining program (and all data mining programs are mindless) might conclude that this is a knowledge discovery of a useful tool for forecasting the weather in Curtin. A mindful human would think that it is preposterous to think that the best way to forecast the low temperature tomorrow in a town in Australia is to look at the high temperature today in a town in Washington.

With modern computers, it would be easy for me to look at thousands or millions of random variables until I stumble upon one with an astonishingly close correlation with the temperature in Curtin, or any other city for that matter. And what, Exactly, would I prove? Nothing at all — and that is the first thing to remember about data mining. If we scrutinize lots of data, we will find statistical patterns no matter where something real is going on or not. The second thing to remember is that, even though it is called artificial intelligence, data mining software is not intelligent enough to tell the difference between patterns that reflect real relationships and patterns that are coincidental. Only humans can do that.” – The AI Delusion

Is the Conclusion Backward Engineered from the History? 

This is expressed in the following quotation. 

“Fallacy #2 is also called the Feyman Trap, a reference to Nobel Laureate Richard Feynman. Feynman asked his Caltech students to calculate the probability that, if he walked outside the classroom, the first car in the parking lot would have a specific license plate, say 8NSR26. The students calculated a probability by assuming each number and letter were equally likely and independently determined. The answer is less than 1 in 17 million. When the students finished their calculations, Feynman revealed that the correct probability was 1 because he had seen this license plate on this way to class. Something extremely unlikely is not unlikely at all if it has already happened.” – The AI Delusion

This issue is further illuminated in the following quotation. 

When you collect a lot of data you are using that data to build systems that are primarily driven by statistics. Luis says that we latch onto statistics when we feed AI so much data, and that we ascribe to systems intelligence, when in reality, all we have done is created large probabilistic systems that by virtue of large data sets exhibit things we ascribe to intelligence. He says that when our systems aren’t learning as we want, the primary gut reaction is to give these AI systems more data so that we don’t have to think as much about the hard parts of generalization and intelligence. – Forbes

Yes, generalized intelligence is a complicated problem. However, one can always argue that more data is necessary. This has been the long term proposal of AI that all that is required is more data and more processing power. 


Proposals around Big Data have been tremendously inaccurate. And those who have made these claims have not been held accountable for their inaccuracy. Rather the expectations around Big Data have very seamlessly been transferred to data science, putting off the recognition of the claims shortcomings.