What is data science?


It is likely that the first appearance of “data science” as a term in literature was in the preface to Naur’s book “Concise Survey of Computer Methods” [4] in 1974. In that preface, data science was defined as “the science of dealing with data, once they have been established, while the relation of the data to what they represent is delegated to other fields and sciences.” Since then, the concept of “data science” has been progressively discussed in the statistical and mathematical communities [9], at which time it essentially concerned data analysis. However, this has inspired today’s significant move to the comprehensive exploration of scientific content and development.

Today, the art of data science [8] goes beyond specific areas like data mining and analysis, and the argument that data science is the next generation of statistics [5, 6, 7]. Data science is becoming a very rich concept which carries the vision and responsibilities of an independent scientific field that is systematic and inter-disciplinary.

So what is data science? We can define data science as being high level targeted, object focused, process based, or discipline oriented.


Data science high-level definition

A high-level statement is:

Definition 1. Data science is the science of data, or data science is the study of data. [1]


Data science disciplinary definition

From the disciplinary perspective, data science is a new interdisciplinary field that synthesizes and builds on statistics, informatics, computing, communication, management and sociology to study data and its environments (including domains and other contextual aspects, such as organizational and social aspects) in order to transform data to insights and decisions by following a data-to-knowledge-to-intelligence-to-decision thinking and methodology. Accordingly, a discipline-based data science formula is given below:

Definition 2. Data science = statistics + mathematics + informatics + computing + communication + sociology + management + decision | data + domain + thinking [1,2]
where “|” means “conditional on.”


Data science process-based definition

From the process perspective,

Definition 3. Data science is a systematic approach to “thinking with wisdom,” “understanding domain,” “managing data,” “computing with data,” “mining on knowledge,” “communicating with stakeholders,” “delivering products,” and “acting on insights.” [3]

In contrast, data analytics understands data and its underlying business, discovers knowledge, delivers actionable insights, and enables decision making. From this perspective, we can say that analytics is a keystone of data science.



[1] Longbing Cao. Data Science: A Comprehensive Overview. Submitted to ACM Computing Surveys for review.

[2] Longbing Cao and Usama Fayyad. Data Science: Challenges and Directions. Communications of the ACM, 2016.

[3] Longbing Cao. Data Science: Nature and Pitfalls. IEEE Intelligent Systems, Volume: 31, Issue: 5, 66-75, 2016.

[4] Peter Naur. Concise Survey of Computer Methods. Studentlitteratur, Lund, Sweden. 1974.

[5] W. S. Cleveland. Data science: An action plan for expanding the technical areas of the field of statistics. International Statistical Review, 69(1):21–26, 2001.

[6] D. Donoho. 50 years of data science, 2015.

[7] P. J. Huber. Data Analysis: What Can Be Learned from the Past 50 Years. John Wiley & Sons, 2011.

[8] K. Matsudaira. The science of managing data science. Communications of the ACM, 58(6):44–47, 2015.

[9] J. W. Tukey. The future of data analysis. Ann. Math. Statist., 33(1):1–67, 1962