Data DNA

As a result of data quantification, data is everywhere, including the Internet; the IoT; sensor networks; sociocultural, economic, and geographical repositories; and quantified personalized sensors, including mobile, social, living, entertaining, and emotional sources. This forms the “datalogical” constituent, data DNA, which plays a critical role in data organisms and performs a similar function to biological DNA in living organisms.

Definition. Data DNA is the datalogical “molecule” of data, consisting of fundamental and generic constituents: entity (E), property (P), and relationship (R). Here, “datalogical” means that data DNA plays a similar role in data organisms as biological DNA plays in living organisms. Entity can be an object, instance, human, organization, system, or part of a subsystem. Property refers to the attributes that describe an entity. Relationship corresponds to entity interactions and property interactions, including property value interactions.

Entity, property, and relationship present different characteristics in terms of quantity, type, hierarchy, structure, distribution, and organization. A data intensive application or system often comprises many diverse entities, each of which has specific properties, and different relationships are embedded within and between properties and entities. From the lowest to the highest levels, data DNA presents heterogeneity and hierarchical couplings across levels. On each level, it maintains consistency (inheritance of properties and relationships) as well as variations (mutations) across entities, properties, and relationships, while supporting personalized characteristics for each individual entity, property, and relationship.

For a given data, its entities, properties, and relationships are instantiated into diverse and domain-specific forms, which carry most of the data’s ecological and genetic information in data generation, development, functioning, reproduction, and evolution. In the data world, data DNA is embedded in the whole body of personal [1] and non-personal data organisms, and in the generation, development, functioning, management, analysis, and use of all data-based applications and systems. Data DNA drives the evolution of a data-intensive organism. For example, university data DNA connects the data of students, lecturers, administrative systems, corporate services, and operations. The student data DNA further consists of academic, pathway, library access, online access, social media, mobile service, GPS, and Wi-Fi usage data. Such student data DNA is both fixed and evolving. In complex data, data DNA is embedded within various X-complexities [2,3,4] and ubiquitous X-intelligence [3,4,5] in a data organism. This makes data rich in content, characteristics, semantics, and value, but challenging in acquisition, preparation, presentation, analysis, and interpretation.


Note: Excerpted from “L. Cao. Data Science: Nature and Pitfalls, IEEE Intelligent Systems, Volume: 31, Issue: 5, 66-75, 2016”

[1] K. Schwab, The Global Competitiveness Report 2011–2012, report, World Economic Forum, 2011

[2] M. Mitchell, Complexity: A Guided Tour, Oxford Univ. Press, 2011. 7. L. Cao, Metasynthetic Computing and Engineering of Complex Systems, Springer, 2015

[3] X. S. Qian, J. Y. Yu, and R. W. Dai. 1993. A new discipline of science: The study of open complex giant system

and its methodology. Chin. J. Syst. Eng. Electron. 4, 2 (1993), 2–12.

[4] L. Cao, Metasynthetic Computing and Engineering of Complex Systems, Springer, 2015

[5] Longbing Cao and Usama Fayyad. Data Science: Challenges and Directions. Communications of the ACM, 2016.