Disciplinary challenges and directions
Data-to-capability development gaps

The rapid increase in big data has led to significant gaps between what is in the data and how much we can understand. Figure 1 shows empirically the data development gaps between the growth of data potentials and the state-of-the-art capabilities. Such gaps have increased in the past 10 years, especially recently, owing to the imbalance between potential exponential increase and progressive state-of-the art capability development. Examples of such gaps could include the gaps between

  • data availability and the currently understandable data level, scale, and degree;
  • data complexities and the currently available analytics theories and tools;
  • data complexities and the currently available technical capabilities;
  • possible values and impact and currently achievable outcomes and benefits;
  • organizational needs and the currently available talent (that is, data scientists); and
  • potential opportunities and the current outcomes and benefits achievable.


Such growth gaps are driven by critical challenges for which there is a shortage of effective theories and tools. For example, a typical challenge in complex data concerns intrinsic complex coupling relationships and heterogeneity, forming data that is not independent and identically distributed (IID), which cannot be simplified in such a way that it can be handled by classic IID learning theories and systems. Other examples include the real-time learning of large-scale online data, such as learning shopping manipulation and making

real-time recommendations on high-frequency data in the “11-11” shopping seasons launched by Alibaba, or identifying suspects in an imbalanced and multisource data and environment such as fraud detection in high-frequency market trading. Other challenges are high invisibility, high frequency, high uncertainty, high dimensionality, the dynamic nature, mixed sources, online learning at the Web scale, and the development of human-like thinking.

Critical development

Figure 1. Critical development gaps between data potential and state-of-the-art capabilities.


Note: Excerpted from “L. Cao. Data Science: Nature and Pitfalls, IEEE Intelligent Systems, Volume: 31, Issue: 5, 66-75, 2016”


The extreme challenge

Different types and levels of analytical problems trouble the existing knowledge base, and we are especially challenged by the problems in complex data and environments. Our focus on data science research and innovation concerns what we call an extreme data challenge in data science and analytics. The extreme data challenge illustrated in Figure 2 seeks to

discover and deliver complex knowledge in complex data, taking into account complex behavior within a complex environment to achieve actionable insights that will inform and

enable decision action-taking in complex business problems that cannot be better handled by other methods.


The critical future directions of data science research and innovation in this case focus on the following:

  • complex data with complex characteristics;
  • complex behaviors with complex relationships and dynamics;
  • complex environments in which complex data and behaviors are embedded and interacted with;
  • complex models to address the data and behavior complexities in a complex environment;
  • complex findings to uncover hidden but technically interesting and business-friendly observations, indicators or evidence, statements, or presentations; and
  • actionable insights to demonstrate the next best or worst situation and inform the optimal strategies to support effective business decision making.


Many real-life problems fall into this level of complexities and challenges, as the extreme data challenge shows, and they have not been addressed well. One example is understanding group behaviors by multiple actors when there are complex interactions and relationships, such as in the manipulation of large-scale cross-capital markets pool by internationally collaborative investors, each of whom plays a role by connecting information from the underlying markets, social media, other financial markets, socioeconomic data, and policies. Another example would be to predict local climate change and effect by connecting local, regional, and global climate, geographical, and agricultural data and other information.


Note: Excerpted from “L. Cao. Data Science: Nature and Pitfalls, IEEE Intelligent Systems, Volume: 31, Issue: 5, 66-75, 2016”


Five major disciplinary directions

Figure 2 illustrates the conceptual landscape of data science and its major research issues by taking an interdisciplinary, complex system-based, and hierarchical view. As shown in Figure 2, the data science landscape consists of three layers: the data input including domain-specific data applications and systems, X-complexity and X-intelligence in the data and business, the data-driven discovery consisting of a collection of discovery tasks and challenges, and the data output composed of various results and outcomes.

Data science conceptual landscape

Figure 2: Data science conceptual landscape.


Research challenges and opportunities emerge from all three layers, which are categorized in terms of five major areas that cannot be managed well by existing methodologies, theories and systems.

Data/business understanding challenges:

This is to identify, specify, represent and quantify the X-complexities and X-intelligence that cannot be managed well by existing theories and techniques but nevertheless exist and are embedded in a domain-specific data and business problem. Examples are to understand in what forms, at what level, and to what extent the respective complexities and intelligence interact and integrate with each other, and to devise effective methodologies and technologies for incorporating them into data science tasks and processes.

Mathematical and statistical foundation challenges:

This is to discover and explore whether, how and why existing theoretical foundations are insufficient, missing, or problematic in disclosing, describing, representing, and capturing the above complexities and intelligence and obtaining actionable insights. Existing theories may need to be extended or substantially redeveloped so as to cater for the complexities in complex data and business, for example, supporting multiple, heterogeneous and large scale hypothesis testing and survey design, learning inconsistency, change and uncertainty across multiple sources of data, enabling large scale fine-grained personalized predictions, supporting non-IID data analysis, and creating scalable, transparent, flexible, interpretable, personalized and parameter-free modeling.

Data/knowledge engineering and X-analytics challenges:

This is to develop domain-specific analytic theories, tools and systems that are not available in the body of knowledge, to represent, discover, implement and manage the relevant and resultant data, knowledge and intelligence, and to support the corresponding data and analytics engineering. Examples are autonomous and automated analytical software that can automate the process, and self-monitor, self-diagnose and self-adapt to data characteristics and domain-specific context, and learning algorithms that can recognize data complexities and self-train the corresponding optimal models customized for the data.


Quality and social issues challenges:

This is to identify, specify and respect social issues related to the domain-specific data and business understanding and data science processes, including processing and protecting privacy, security and trust and enabling social issues-based data science tasks, which have not previously been handled well. Examples are privacy preserving analytical algorithms, and benchmarking the trustfulness of analytical outcomes.


Data value, impact and utility challenges:

This is to identify, specify, quantify and evaluate the value, impact and utility associated with domain-specific data that cannot be addressed by existing theories and systems, from technical, business, subjective and objective perspectives. Examples are the development of measurement for actionability, utility and values of data.


Data-to-decision and action-taking challenges:

This is to develop decision-support theories and systems to enable data-driven decision generation, insight-to-decision transformation, and decision-making action generation, incorporating prescriptive actions and strategies into production, and data-driven decision management and governance which cannot be managed by existing technologies and systems. Examples include tools for transforming analytical findings to decision-making actions or intervention strategies.


Note: Excerpted from “Longbing Cao and Usama Fayyad. Data Science: Challenges and Directions. Communications of the ACM, 2016.