Non-IID learning


Big data is complex, which owns certain X-complexities, including complex coupling relationships and/or mixed distributions, formats, types and variables, and unstructured and weakly structured data. Such complex data has proposed significant challenges to many existing mathematical, statistical, and analytical methods which have been built on certain assumptions, owing to the fact that these assumptions are violated in big data. One of such violations highlighted here is the independent and identically distributed (IID) assumption, because big/complex data (referring to objects, values, attributes, and other aspects [2]) is essentially non-IID, whereas most of existing analytical methods are IID [2, 11].

In a non-IID data problem (see Figure 1(a)), non-IIDness (see Figure 1(c)) refers to any couplings (both well-explored relationships such as co-occurrence, neighborhood, dependency, linkage, correlation, and causality, and poorly-explored and ill-structured ones such as sophisticated cultural and religious connections and influence) and heterogeneity, which exist within and between two or more aspects, such as entity, entity class, entity property (variable), process, fact and state of affairs, or other types of entities or properties (such as learners and learned results) appearing or produced prior to, during and after a target process (such as a learning task). By contrast, IIDness ignores or simplifies them, as shown in Figure 1(b).


Figure 1: IIDness vs. non-IIDness in data science problems.

Learning visible and especially invisible non-IIDness is fundamental for a deep understanding of data with weak and/or unclear structures, distributions, relationships, and semantics. In many cases, locally visible but globally invisible (or vice versa) non-IIDness are presented in a range of forms, structures, and layers and on diverse entities. Often, individual learners cannot tell the whole story due to their inability to identify such complex non-IIDness. Effectively learning the widespread, various, visible and invisible non-IIDness is thus crucial for obtaining the truth and a complete picture of the underlying problem.

We frequently only focus on explicit non-IIDness, which is visible to us and easy to learn. Typically, work in the hybridization of multiple methods and the combination of multiple sources of data into a big table for analysis fall into this category. Computing non-IIDness refers to understanding, formalizing and quantifying the non-IID aspects, entities, interactions, layers, forms and strength. This includes extracting, discovering and estimating the interactions and heterogeneity between learning components, including the method, objective, task, level, dimension, process, measure and outcome, especially when the learning involves multiples of one of the above components, such as multi-methods or multi-tasks. We are concerned about understanding non-IIDness at a range of levels from values, attributes, objects, methods and measures to processing outcomes (such as mined patterns). Such non-IIDness is both comprehensive and complex.


Non-IIDness in complex data

Non-IIDness refers to any relationships (for instance, co-occurrence, neighborhood, dependency, linkage, correlation, or causality) and heterogeneity between two or more aspects, such as object, object class, object property (variable), process, fact and state of affairs, or other types of entities or properties (such as learners and learned results) appearing or produced prior to, during and after a target process (such as a learning task).

In a learning system, as shown in Fig. 2, non-IIDness may exist within and/or between aspects, such as entity (objects, object class, instance, or group/community) and its/their properties (variables), context (environment) and its constraints, interactions (exchange of information, material or energy) between entities or between the entity and its/their environment, learning objectives (targets, such as risk level or fraud), the corresponding learning methods (models, algorithms or systems) and resultant outcomes (such as patterns or clusters).

Various aspects

Fig. 2 Various aspects of and hierarchical non-IIDness


Non-IID research directions

Below, we illustrate the main prospects of inventing new and effective data science theories and tools for non-IID learning (also called non-IIDness learning, non-IID data learning, or learning from non-IID data [2]. We examine how to address the non-IID data characteristics (note, not just about IID objects) in terms of new feature analysis by considering feature relations and distributions, new learning theories, algorithms and models for analytics, and new metrics for similarity measurement and evaluation.

  • Deep understanding of non-IID data characteristics: This is to identify, specify and quantify non-IID data characteristics, factors, aspects, forms, types, levels of non-IIDness in data and business, and identify the difference between what can be captured by existing data/business understanding technologies and systems and what is left out.
  • New and effective non-IID feature analysis and construction: This is to invent new theories and tools for the analysis of feature relationships by considering non-IIDness within and between features and objects, and developing new theories and algorithms for selecting, mining and constructing features.
  • New non-IID learning theories, algorithms and models: This is to create new theories, algorithms and models for analyzing, learning, and mining non-IID data by considering value-to-object couplings and heterogeneity.
  • New non-IID similarity and evaluation metrics: This is to develop new similarity and dissimilarity learning methods and metrics, as well as evaluation metrics that consider non-IIDness in data and business.

More broadly, many existing data-oriented theories, designs, mechanisms, systems and tools may need to be reinvented when non-IIDness is taken into consideration. In addition to non-IID learning for data mining, machine learning and general data analytics, this involves well-established bodies of knowledge, including mathematical and statistical foundations, descriptive analytics theories and tools, data management theories and systems, information retrieval theories and tools, multi-media analysis, and X-analytics.


Relevant research on non-IID learning


Some relevant activities on non-IID learning