AI & ML

The Problem with the Term Data Science

Executive Summary

  • Data science is a term that is frequently used without questioning if the term is accurate.
  • We explain why people should think twice when using the term data science.

Introduction

The term data science is a recent one, and it is used without most of the listeners questioning the accuracy of what the term describes. This will be one of the very few articles to ask about the origin of the term and whether it describes what the field does.

How Accurate Is the Term Data Science?

The term data science seems ok unless one ponders the term. First, let us review the definition of science.

Science is…

“is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.” – Wikipedia

However, the problem comes very fast, because that data science is not the actual study of data itself. That is, it is not the study of how data is formatted or how it becomes meaning. Instead, it is the application of methods related to statistics, extracting data, formatting can cleaning data, running algorithms against data, and related activities that help one obtain an insight.

Using Data Science for Science?

Is data “the science” or the subject being studied using data science? If data science were deployed for science like geology or physics, then would data science be deployed for another science? Is the objective of data science to extend the field of data science? 

The answer is no.

Data science is a way of obtaining insights into a specific domain. Here is another hint — what other areas of an IT department is employed in a scientific endeavor? 

The answer is none. 

IT Departments Engage in Science?

IT departments leverage hardware and software, which is based upon computer science, but they are not themselves engaged in a scientific endeavor. As a person who has repeatedly tried to bring scientific approaches to testing to a number of IT departments, I have first-hand experience with this topic. Beyond being unscientific, IT departments typically have considerable limitations in even DOCUMENTING what they are doing (no time!) 

Interdisciplinary?

Another hint that data science is not a science is the term “interdisciplinary,” which is used in the definition of data science.

The term interdisciplinary has long been a term used in the academic community to sound like something quite elevated, when, in fact, it is a signal that the area is not a real thing itself. If an area is interdisciplinary, then it is merely a combination of other fields. If one studies the impact of a change in bio-diversity in relationship to the building of a dam, the study is interdisciplinary — as it includes at least three disciplines (biology and geology and civil engineering) and perhaps more depending upon the scope of the study. 

Interdisciplinary research happens all the time.

However, if your entire science or area is defined by being interdisciplinary, then this is an excellent indicator that the field is not itself its own distinct area. 

See the following quotation for some clues. 

“Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Turing award winner Jim Gray imagined data science as a “fourth paradigm” of science (empirical, theoretical, computational and now data-driven) and asserted that “everything about science is changing because of the impact of information technology” and the data deluge.” – Wikipedia

Every area listed in this description existed long before data science appeared on the field. Data science is nothing more than a meta term used to be a broad category of sub-areas — which by the away are actually not sciences. Jim Gray makes a strange contention that data science can be a fourth paradigm as data science is just a modern approach to evidence gathering.

Science is changing because of information technology (not all of it good by the way), as computers with massive processing, capabilities are also driving to false conclusions, as is covered in the academic article by John Loannidis titled, Why Most Published Research is False. A quote from which is included below. 

“Several methodologists have pointed out [9–11] that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values. Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors, or associations. Published research findings are sometimes refuted by subsequent evidence, with ensuing confusion and disappointment. Refutation and controversy is seen across the range of research designs, from clinical trials and traditional epidemiological studies [1–3] to the most modern molecular research [4,5]. There is increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims [6–8]. However, this should not be surprising. It can be proven that most claimed research findings are false.”

This is expressed slightly differently in the following quotation. 

“Today, with terabytes of data and lightning-fast computers, it is too easy to calculate first, think later. This is a flaw not a feature. It has been better to think hard before calculating. Who thinks? Just let the software ransack the data looking for unexpected relationships. 

Do serious researchers really torture data? Far too often. It’s how well respected people came up with the now-discredited ideas that coffee causes pancreatic cancer, that people can be healed by positive energy from healers living thousands of miles away, and that hurricanes are deadlier if they have female names. 

The next day at the Googleplex, I heard chemists, biologists, astrophysicists, and other scientists express their concerns about the “replication crisis,” fanciful claims that are published in reputable journals but cannot be replicated, and how this is undermined the credibility of scientific research. One prominent social psychologist said that his field is the poster child for irreproducible research and that his default assumption is that anything published in his field is wrong, evidently because too many social psychologists do not understand the fundamental truth that discovering a pattern in ransacked data proves nothing more than that they ransacked the data looking for patterns. I also overheard a group talking about the implausibility of the claim that children are affected by seeing or handling money. 

Identifying successful companies and discovering common characteristics (which is exactly what data mining software would do) is meaningless because, after the fact, there are always successful companies and they always have common characteristics.” – The AI Delusion

Ransack the Data

As is covered in other areas of the book, ransacking large amounts of data with automated algorithms that search for relationships, oftentimes without any driving hypothesis, increases the number of false associated or correlative claims.  

But regardless, the fact that something is affecting scientific research does not mean that it is necessarily a science. And I can prove it. The Internet has dramatically changed scientific research allowing researchers to research things to find the published research in areas very quickly easily.

However, is the Internet also a science?

No. 

The Internet is an emergent property based upon physics, electrical engineering, networking, telephony, etc.. but the Internet itself is not a science.  

Data Science…an Inaccurate Term

From this analysis, it is apparent that data science is not an accurate term for what happens in the field of data science. Data science can be deployed by an actual science to obtain insight, but it is not a science. 

So if this is true, then why is data science misnamed?

Whenever a term is inaccurate, the most common explanation is that the term helps to “sell” the field. The term data science borrows credibility from the established status of science. This is not the first time that science has been co-opted for its credibility. Scientology, which is some amalgam of self-help/religion/psychology/money-making scheme started by a con man who never completed his engineering undergraduate degree named L Ron Hubbard. Hubbard did the same thing. Scientology claims to be either a science or a type of science. The lower levels of Scientology curriculum borrow from established academic fields like psychology and physics (while claiming to have contributed original research in these fields, while copying directly from textbooks), with the originator claiming in-depth knowledge of nuclear physics. All of this has led to a lot of money for the top hierarchy of Scientology. And none of it has anything to do with science. Something is not a science because it calls itself a science or scientific, but because it follows scientific principles. And something can be part of established science, but the experiment faked, such as when drug companies run rigged clinical trials, and again it is not science. 

What Was the Effect of Making a False Claim to Science?

And one could pivot to the question of how well has the reframing of a number of data tools into “data science” has been for the field?

The answer is that data science probably the hottest thing to have on one’s resume in the IT field. Data science is a recent field, and those with experience in its supporting areas seek to rebrand themselves as data scientists rather than just a “regular old analyst.”

This is covered in the following quote.

“There are lots of proper Statisticians out there who do great work, day in and day out. We really never hear about those folks anymore in the data world. Lots of the really good Data Scientists out there are really just statisticians with a new title.(emphasis added) Others are a mix of stats, engineering, math, and programming that are a jack of all trades, master of none.

It’s no secret why everyone is calling themselves a Data Scientist in 2020. Imagine you’re a Statistician who does great, honest work (science?), day in and day out. According to Indeed, Glassdoor, etc, your salary is somewhere between $70k and $110k each year. And then you look over your shoulder and see a Data Scientist bringing in somewhere between $90k and $165k each year. And in your heart you know that “Data Scientist” doesn’t have nearly the Statistics chops that you’ve got. Maybe they can write a little bit more Python than you or maybe a little bit of R, probably nothing you couldn’t pick up in a few months… Maybe they can do SELECT * FROM Table; and get some data but nothing too sophisticated.

So you go to a few company websites, a few job boards, email a few friends and colleagues, and all of a sudden you’ve landed a shiny new job at a different company where you’re now the proud wearer of the title “Data Scientist.”

Congrats! Conveniently, you also came in with your shiny graduate degree, your prestigious research background, all of a sudden your $95k salary at your previous company is $130k at your new company. Wow! What an upgrade. And then your boss meets with you and assigns you your work plan. You toil over the work and very quickly realize your day-to-day has gone from a highly skilled statistician workload to that of a SQL warrior. All of a sudden you’re spending 90% of your time building reports, delivering PowerPoints, and building Power BI dashboards to share daily user metrics. You half laugh, half cry as you realize you’re now doing the work of an entry-level data analyst. 10 years of graduate school and another 5 as a postdoc to spend your day writing a few SQL queries and maintain old dashboards.

Not only has this statistician found herself in a win-lose situation (win because she’s making lots of money, lose because she hates her job), the business has found itself in a lose-lose situation. The business is paying $130k for a role they could have filled with a highly skilled analyst for $75–100k. They’re also getting a lower quality of work because the statistician just isn’t interested in doing the work. The statistician is clearly overqualified for this work, but studies show that overqualified workers underperform when their heart isn’t in the work.” – Towards Data Science

Old Wine New Bottles?

These vats are much more valuable if they are filled with data science, rather than good old fashioned data analysis. 

The IT field is constantly putting old wine into new bottles. This is particularly important as most new claims don’t meet expectations. With few people noticing, the claims around Big Data, have been transposed to data science and machine learning. If transposition can continue indefinitely, no area has ever to be held accountable for what was promised. 

Accountability is bad. It is bad for vendors, for consulting firms, and for IT decision-makers who were tricked by inaccurate claims. Is it any wonder that there is close to no effort to fact check claims. Who wants to be proven wrong?

This is what IT buyers see. Vendors and consulting firms present them with an unlimited number of options. Some of them renamed old things that did not work out. This is an environment with close to zero concern for whether any of the terms are accurate, or holding previous claims accountable.

What Changed in the Technical Sense

Now let’s switch from the topic of the validity of data science being an actual science to the related and equally important topic/question of how has the overall field related to data science changed since the introduction of the term “data science.” 

The answer is that nothing changed to create a new field. If you were doing a Big Data project using a variety of tools before the introduction of data science, you are now using either the same tools or upgraded versions of the tools that you were using before data science appeared on the scene as a term.

Secondly, the change in terminology for what we call gaining insights from data has repeatedly changed in the past.

This is explained in the following quotation. 

“1996 Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth publish “From Data Mining to Knowledge Discovery in Databases.” They write: “Historically, the notion of finding useful patterns in data has been given a variety of names, including data mining, knowledge extraction, information discovery, information harvesting, data archeology, and data pattern processing” – Forbes

New Term Needed for “Bigger Data?”

One of the proposals is that a new term was necessary because the size of the data being analyzed increased, as explained in the following quotation. 

“December 1999 Jacob Zahavi is quoted in “Mining Data for Nuggets of Knowledge” in Knowledge@Wharton: “Conventional statistical methods work well with small data sets. Today’s databases, however, can involve millions of rows and scores of columns of data… Scalability is a huge issue in data mining. Another technical challenge is developing models that can do a better job analyzing data, detecting non-linear relationships and interaction between elements… Special data mining tools may have to be developed to address web-site decisions.” – Forbes

However, this still does not present a logical argument to come up with a new term. And data science is not a set of special data mining tools.

Secondly, ordinary statistics are still deployed when analyzing large data sets. That is, of course, the strength of statistics, and statistics do not weaken as the data sets grow in size. 

Third, neither Forbes nor Knowledge@Wharton is credible sources. We routinely lampoon Forbes articles, as Forbes is just a Chinese owned media outlet that allows anyone who pays them to publish on their site was we cover in the article Can You Trust IDC and Their Now China Based Owners? (call them and ask for their price sheet). We caught Knowledge@Wharton allowing Vishal Sikka to make a number of false claims unchecked in the article How Much of Vishal Sikka’s Explanations on Artificial Intelligence are Complete BS?.

The Reason — According to the Wall Street Journal

Below we can see another attempt to justify the creation of and use of the term of data science. 

He defines data science as being essentially the systematic study of the extraction of knowledge from data. But analyzing data is something people have been doing with statistics and related methods for a while. “Why then do we need a new term like data science when we have had statistics for centuries? The fact that we now have huge amounts of data should not in and of itself justify the need for a new term.”

In short, it’s all about the difference between explaining and predicting. Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting patterns from individual data sets with well-formulated queries. Data science, on the other hand, aims to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on. – Wall Street Journal

Talk about double talk.

This is also a false distinction. Was pre-data science data analysis not concerned with supporting decisions or making predictions? I can say that it certainly was. I performed a large amount of data analysis before the term data science became popular, and the companies I performed the data analysis for were extremely concerned with supporting decisions and making predictions.

For what reason does the Wall Street Journal think we were doing (pre-data science) data analysis….for fun? 

This is a picture of me, circa 2003, before the term Data Science was invented.

Life was great back then. Companies would hire you to do data analysis with no concern for supporting decision making or prediction.

Then Data Science came along and….well, the party was over

Enough With the Fake New Proposals

Data science is not a new type of prediction. Recall, it was just a few years ago that we were told visualization analytics applications like Tableau were new types of analysis, and of course nothing like the graphics we had been using for decades. 

Data science uses more sophisticated computer techniques to perform prediction; the highest level of knowledge has always been considered a prediction. Long before data science was coined as a term, the term “overfitting” where a model is able to fit the past, but not being good at predicting the future, has been a core concept of statistics. In fact, as we cover in the article The Statistical Falseness of Much AI, Data Science, and Big Data Research, data science is becoming known for not worrying about the validity of relationships it finds through data ransacking. 

The same Wall Street Journal article goes on to make another contention that undermines the use of the term science. 

“Most of us are trained to believe theory must originate in the human mind based on prior theory, with data then gathered to demonstrate the validity of the theory. Machine learning turns this process around. Given a large trove of data, the computer taunts us by saying, If only you knew what question to ask me, I would give you some very interesting answers based on the data. Such a capability is powerful since we often do not know what question to ask. . .“Suitably designed machine learning algorithms help find such patterns for us. To be useful both practically and scientifically, the patterns must be predictive.” – Wall Street Journal

This crosses over the topic of what is the domain, as I stated earlier, data science is not the study of data, it is the use of data management and statistical techniques to gain insight from data.

Secondly, related to the quote, machine learning runs on data sets and then tells us correlations (paraphrasing the quotation). Well, that is not even a scientific method.

I have now identified multiple dimensions by which the term data science is misleading as to what the area or field of study actually does. 

The Wall Street Journal article goes on to make another inaccurate claim.

“Physics, chemistry, biology and other natural science disciplines have long been practicing their own version of data science. In physics, for example, “a theory is expected to be complete in the sense a relationship among certain variables is intended to explain the phenomenon completely, with no exceptions. . . In such domains, the explanatory and predictive models are synonymous.”

No, they have not. 

This quotation asserts that looking for evidence is data science. This is a ludicrous proposal. Let us take an example. Einstein’s general theory of relativity included the prediction that gravity bent light. This theory was tested four years after the initial prediction by measurements taken from the June 8, 1918, total eclipse, and measurements showing light bending around the sun validated Einstein’s theory. Was this data science that proved the general theory of relativity?

Of course not.

Gathering evidence has always been how hypotheses are proven or disproven. Some quotes, at least, on data science not only misrepresent what data science is but go back in time to contradict all efforts to gather evidence to test a hypothesis.

Talk about grandiose claims. 

A Realistic Explanation Around the Creation of the Term Data Science

The following quote from Pete Warden is one of the most honest we could find on creating the term data science. 

“Data and the tools to process it are suddenly abundant and cheap. Thousands of people are exploiting this change, making things that would have been impossible or impractical before now, using a whole new set of techniques. We need a term to describe this movement, so we can create job ads, conferences, training and books that reach the right people. Those goals might sound very mundane, but without an agreed-upon term we just can’t communicate.” 

That might be true, but it still does not address whether the term is accurate. One can have a motivation to create a term or categorization, and nevertheless, still, end up with a poorly defined term. This is what happened with data science. The people that first coined it should have looked up the definition of science. The term science is not something you just spread like butter on bread for anything you want to enhance in prestige.

The Problems with the Term Data Science

The first problem with the term data science is that it is inaccurate. The second problem is that it seeks to adopt the prestige of a scientific discipline, without actually being a scientific discipline. As is nearly always the case with promoting something that is false, the term data science also has negative repercussions on the implementation of what the term seeks to describe.

How?

By creating an unrealistic standard for those that are to work in the field. 

Exaggerated Expectations of “Data Scientists”

A serious problem with the term data science is that is it implies that data scientists will be deeply skilled in all of the supporting disciplines that make up data science.

This is expressed in the following quotation. 

“It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path. As Portillo told us, “The traditional backgrounds of people you saw 10 to 15 years ago just don’t cut it these days.” A quantitative analyst can be great at analyzing data but not at subduing a mass of unstructured data and getting it into a form in which it can be analyzed. A data management expert might be great at generating and organizing data in structured form but not at turning unstructured data into structured data—and also not at actually analyzing the data. And while people without strong social skills might thrive in traditional data professions, data scientists must have such skills to be effective.” – Harvard Business Review

The descriptions laid on in this quote, as well as in innumerable data scientist job specifications, describe a type of superhuman whose existence we are skeptical.

People are not both excellent Python programmers while being skilled statisticians, data mungers, along with all the other skills listed in these job specifications. Data science may have been a convenient if false, label to apply to the categories of knowledge that underlie data science. However, the term also created an exaggerated template of skills that anyone human can master. The technologies of each of these subfields experience a constant level of change, meaning that the data scientist is not only responsible for mastering all of these fields, but also for keeping current with the changes. 

This is expressed in the following quotation. 

“Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it… — Dan Ariely

This quote is so apt. Many junior data scientists I know (this includes myself) wanted to get into data science because it was all about solving complex problems with cool new machine learning algorithms that make huge impact on a business. This was a chance to feel like the work we were doing was more important than anything we’ve done before. However, this is often not the case.

Following on from doing anything to please the right people, those very same people with all of the clout often don’t understand what is meant by “data scientist”. This means that you’ll be the analytics expert as well as the go-to reporting guy and let’s not forget that you’ll be the database expert too.

It isn’t just non-technical executives that make too many assumptions about your skills. Other colleagues in technology assume you know everything data related. You know your way around Spark, Hadoop, Hive, Pig, SQL, Neo4J, MySQL, Python, R, Scala, Tensorflow, A/B Testing, NLP, anything machine learning (and anything else data related that you can think of — BTW if you see a job specification with all of these written on it, stay well clear. It reeks of a job spec from a company that has no idea what their data strategy is and they’ll hire anyone because they think that hiring any data person will fix all of their data problems).

But it doesn’t stop there. Because you know all of this and you obviously have access to ALL of the data, you are expected to have the answers to ALL of the questions by……. well, it should’ve landed in the relevant person’s inbox 5 minutes ago.In my opinion, the fact that expectation does not match reality is the ultimate reason why many data scientists leave. There are many reasons for this and I can’t possibly come up with an exhaustive list but this post is essentially a list of some of the reasons that I encountered.” – Towards Data Science

Conclusion

The term data science is inaccurate and was coined in order to “sex up” the field of data set management, data munging, statistics, and other areas that eventually all support data analysis. The term pretends to be something new when everything that makes up data science already existed before data science and in many cases for decades or even hundreds of years before the term was coined. Furthermore, at least some data science proponents are making claims around data science being the phase or approach where a hypothesis is tested, which is just false.

The only things that should be called a science are actual sciences. Anything that calls itself a science that is not science — be it Scientology, or data science, should be viewed with immediate suspicion. This is similar to those that try to pass off honorary PhDs as academic PhDs as we cover in the article it is Official If You Work for SAP Its Ok to Lie About Having a Ph.D., or working for a non research or academic entity and having the title of “research fellow,” Why PwC’s Research Fellows are Fake and Pretend to be Academic. This is because the term is designed to trick the message recipient.

References

https://blogs.wsj.com/cio/2014/05/02/why-do-we-need-data-science-when-weve-had-statistics-for-centuries/

https://www.theguardian.com/technology/2018/jul/25/ai-artificial-intelligence-social-media-bots-wrong

https://www.amazon.com/AI-Delusion-Gary-Smith/

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

*http://radar.oreilly.com/2011/05/data-science-terminology.html

https://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#ee628ab55cfc

https://en.wikipedia.org/wiki/Science

https://en.wikipedia.org/wiki/Data_science

https://towardsdatascience.com/is-data-science-really-a-science-9c2249ee2ce4