What is EDM

Education Data Mining

In a nutshell, Educational Data Mining (EDM) is a newly emerging inter-disciplinary research field which focuses on Knowledge Discovery and Data Mining techniques to analyze data from educational settings, including interactive learning systems, intelligent tutoring systems and institutional administration data. The primary goals of EDM is to uncover scientific evidence or patterns that are useful to gain insights and explain educational phenomenons.

Educational Data

Comes from educational settings:

  • Interactive learning environments (Multiple choice questions, response time)
  • Computer aided collaborative learning (hints provided by PC)
  • Administrative data (demographics, enrollment)

Has the following typical characteristics:

  • Multiple level of meaningful hierarchy (subject, assignment, question levels)
  • Time
  • Sequence
  • Context (a particular student in a particular class encountering a particular question in a particular problem on a particular computer at a particular time on a particular date)
  • Fine-grained ( record data at different resolutions to facilitate different analyses, e.g. record data every 20s)
  • Longitudinal (large data recorded for many sessions for a long period of time, e.g. spanning semester land year long courses)

Both outcomes of student works and learning processes are important for EDM:

  • Products of student works (homework, exercises, labs) can be analyzed, but the process of task performance can be even more important (model tracing tracks the problem solving steps, detects mis-steps and provides hints, knowledge tracing estimates student’s mastery of different skills, guide tutorial decisions on which skills to teach and which problems to focus).

How to make educational data minable?

  • Formulating educational data as “machine analyzable” and “machine understandable” experimental trials
  • Experimental trails: decisions, contexts, and outcomes
  • Formulation methods: Segmentation, slicing, formulation, and aggregation tactics.



EDM’s major goal is to improve improve educational outcomes, i.e. enhancing the training and learning performance for educational institutions. In particular,

For students

  • Improve learning performance
  • Evaluate learning effectiveness ( learning is defined as reduction in error made over times of practice)
  • Understand social, cognitive and behavioral aspects

For teachers

  • Improve teaching performance
  • understand student preference, learning experience
  • Respond to student needs
  • Adapt to each individual student

For institutions

  • Distant learning (online)
  • Adaptive tutorials (focus on individual student’s needs and weak points)
  • Evaluate teaching effectiveness (how well teachers respond to student needs)
  • Predict student performance (risks of dropping out or failing a subject)



Education Data Mining Methods

The followings are typical topics of research within Educational Data Mining literature.

1. User driven approach

2. Data driven approach

3. Weighted Mining

4. Feature selection

5. Performance Prediction

Final marks / Results (pass/fail)

6. Cognitive model

Cognitive model is a set of production rules or skills encoded in a computerized tutoring system to model how students solve problems. Production rules embody the knowledge that students are trying to acquire, and allows tutors to estimate each student’s learning of each kill as the student works through the exercises.“How do we know when learning occurs?”

One measure of the performance of a cognitive model is how the data (error rate over times of practice) fits the model (inverse relationship). Newell and Rosenbloom (1993) [1] found the inverse relationship between the error rate of performance and the number of practice – the error rate decreases as the amount of practice increases (the more of practice, the less errors made -> learning is all about). The relationship can be depicted as a power function:


Where Y is the error rate, X is the opportunity to practice a skill, a is the error rate on the first trial, and b is the learning rate.


“How to adjust cognitive model better fit data?”
Complex skill can be decomposed into some subskills to produce smoother learning curves.

Splitting skills into subskills allows the cognitive model to make a needed distinction between production rules and thus provide better hints and do more accurate student modeling.

By considering changes in student performance over time (the “opportunity” variable on the x-axis), a method called “Learning Factor Analysis” [2] does a step further. Rather than simply visually inspecting learning curves for “blips”, we can automatically test whether including or excluding factors (skills) leads to better fitting learning curves. Better fits means a cognitive model that better characterizes learning difficulties for a student , what factors do or do not change how well one practice opportunity transfers to another.


[1] A. Newell and P. S. Rosenbloom, “Mechanisms of skill acquisition and the law of practice,” in Mit Press Series In Research In Integrated Intelligence: MIT Press Cambridge, MA, USA, 1993.
[2] H. Cen, K. Koedinger, and B. Junker, “Automating Cognitive Model Improvement by A*Search and Logistic Regression,” in Proceedings of AAAI 2005 Workshop on Educational Data Mining, 2005.

7. Topic extraction

“a+b*c =?”, we know that this question tests us 3 skills/topics: addition, multiplication and knowing which operators to be carried on first. So, from student response data, we are expected to find 3 types of errors ( apart from correct answers) corresponding to the 3 topics to be tested. Without human intervention, EDM can analyze the errors and find 3 topics to be tested (topic extraction). The next step is to name these 3 discovered topics so that they are meaningful (i.e., multiplication, addition, operator order).

Manually identifying topic information for each question can be tedious, and may not match the statistical model supported by the data itself. An ability to automatically extract topic information from student scores would avoid such tedium and allow instructors to:

  • Determine which topic a student is struggling the most (problem identification)
  • Give targeted learning suggestions (suggestions)
  • Build a bank of questions with known difficulty levels and known relevancy to each topic in the course (question generation)
  • Generate automatically a test or quiz covering specific topics with a known expected average score (question generation)
  • Compare the data-supported topics (generated by DM) with the instructor’s view on the topics (human expert) to gain a better understanding of the mental model of the students (data-driven and domain knowledge comparison).

8. Missing data

There are normally students who do not take one or more instruments during the course offering.This produces many missing values in education data. For each instrument, we can simply discard the missing individuals, but we should process data as a whole.


EDM Topics per Paper

To be continued

EDM 08 conference (The 1st International Conference on Educational Data Mining)

Montréal, Québec, Canada, June 20-21, 2008

  • DM to classify students (Critobal)
  • Acquiring background knowledge (Claudia)
  • Evaluate tutorial behaviors (Jack)
  • Labeling student behaviors (Ryan)
  • Adaptive test design with NB framework (Michel)
  • Interestingness measures for Association rules in educational data (Agathe)
  • Item type performance covariance
  • Data driven modeling of student interactions
  • Integrating DM and Pedagogical knowledge
  • Predict Math proficiency and standardized test
  • Response time model
  • Mining student behaviors
  • Classification with genetic programming and C4.5
  • The composition effects: Conjuntive or Compensatory?
  • DM by adding rich reporting capabilities.
  • Rule evaluation measures
  • Mining and visualizing trails in web-based systems
  • Mining student assessment data
  • Logic proof tutoring using hints from historical data
  • Skill set profile clustering based on weighted student responses.

[1] Cristobal Romero, Sebastián Ventura, Pedro G. Espejo and Cesar Hervas. Data Mining Algorithms to Classify Students.
[2] Claudia Antunes. Acquiring Background Knowledge for Intelligent Tutoring Systems.
[3] Jack Mostow and Xiaonan Zhang. Analytic Comparison of Three Methods to Evaluate Tutorial Behaviors.
[4] Ryan Baker and Adriana de Carvalho. Labeling Student Behavior Faster and More Precisely with Text Replays.
[5] Michel Desmarais, Alejandro Villarreal and Michel Gagnon. Adaptive Test Design with a Naive Bayes Framework.
[6] Agathe Merceron and Kalina Yacef. Interestingness Measures for Association Rules in Educational Data.
[7] Ryan Baker, Albert Corbett and Vincent Aleven. Improving Contextual Models of Guessing and Slipping with a Truncated Training Set.
[8] Philip Pavlik, Hao Cen, Lili Wu, and Ken Koedinger. Using Item-type Performance Covariance to Improve the Skill Model of an Existing Tutor.
[9] Manolis Mavrikis. Data-driven modelling of students’ interactions in an ILE.
[10] Roland Hubscher and Sadhana Puntambekar. Integrating Knowledge Gained From Data Mining With Pedagogical Knowledge.
[11] Mingyu Feng, Joseph Beck, Neil Heffernan and Kenneth Koedinger. Can an Intelligent Tutoring System Predict Math Proficiency as Well as a Standardized Test?
[12] Benjamin Shih, Kenneth Koedinger and Richard Scheines. A Response Time Model for Bottom-Out Hints as Worked Examples.
[13] Hogyeong Jeong and Gautam Biswas. Mining Student Behavior Models in Learning by-Teaching Environments.
[14] Collin Lynch, Kevin Ashley, Niels Pinkwart and Vincent Aleven. Argument graph classification with Genetic Programming and C4.5.
[15] Zachary Pardos, Neil Heffernan, Carolina Ruiz and Joseph Beck. The Composition Effect: Conjuntive or Compensatory? An Analysis of Multi-Skill Math Questions in ITS.
[16] Ken Koedinger, Kyle Cunningham, Alida Skogsholm and Brett Leber. An open repository and analysis tools for fine-grained, longitudinal learner data.
[17] Anthony Allevato, Matthew Thornton, Stephen Edwards and Manuel Perez-Quinones. Mining Data from an Automated Grading and Testing System by Adding ich Reporting Capabilities.

EDM 09 conference (the 2nd International Conference on Educational Data Mining)

Cordoba, Spain, July 1-3, 2009

  • User-driven and data-driven approach for supporting teachers in reflection and adaptation of adaptive tutorials.
  • Does self-discipline impacts students’ knowledge and learning?
  • Consistency of Student’s pace in online learning?
  • Book recommendation
  • Identify skills that separate students (Data grouping)
  • Dirichlet priors
  • Unsupervised, frequency based metric for selecting hints
  • Recommendation in higher education using data mining
  • Collaboration frameworks
  • Predicting correctness of problem solving
  • Question classification
  • Semantic educational data mining
  • Log (what, how, why)
  • Predicting students’ grades using multiple instance genetic programming.
  • Visualization of data


[1] Elizabeth Ayers, Rebecca Nugent, Nema Dean, A Comparison of Student Skill Knowledge Estimates
[2] Ryan Baker, Di erences Between Intelligent Tutor Lessons, and the Choice to Go O -Task
[3] Dror Ben-Naim, Michael Bain, Nadine Marcus, User-Driven and Data-Driven Approach for Supporting teachers in Reflection and Adaptation of Adaptive Tutorials
[4] Javier Bravo Agapito, Alvaro Ortigosa, Detecting Symptoms of Low Performance Using Production Rules
[5] Gerben Dekker, Mykola Pechenizkiy, Jan Vleeshouwers, Predicting Students Drop Out: A Case Study
[6] Mingyu Feng, Joseph Beck, Neil He ernan, Using Learning Decomposition and Bootstrapping with Randomization to Compare the Impact of Di different Educational Interventions on Learning
[7] Yue Gong, Dovan Rai, Joseph Beck, Neil He ernan, Does Self-Discipline impact students’ knowledge and learning?
[8] Arnon Hershkovitz, Ra Nachmias, Consistency of Students’ Pace in Online Learning
[9] Tara Madhyastha, Steven Tanimoto, Student Consistency and Implications for Feedback in Online Assessment Systems
[10] Ryo Nagata, Keigo Takeda, Koji Suda, Junichi Kakegawa, Koichiro Morihiro, Edu-mining for Book Recommendation for Pupils
[11] Rebecca Nugent, Elizabeth Ayers, Nema Dean, Conditional Subspace Clustering of Skill Mastery: Identifying Skills that Separate Students
[12] Zachary Pardos, Neil He ernan, Determining the Significance of Item Order In Randomized Problem Sets
[13] Philip I Pavlik Jr., Hao Cen, Kenneth R. Koedinger, Learning Factors Transfer Analysis: Using Learning Curve Analysis to Automatically Generate Domain Models
[14] David Prata, Ryan Baker, Evandro Costa, Carolyn Rose, Yue Cui, Detecting and Understanding the Impact of Cognitive and Interpersonal Conflict in Computer Supported Collaborative Learning Environments
[15] Dovan Rai, Yue Gong, Joseph Beck, Using Dirichlet priors to improve model parameter plausibility
[16] Steven Ritter, Thomas Harris, Tristan Nixon, Daniel Dickison,R. Charles Murray, Brendon Towle, Reducing the Knowledge Tracing Space
[17] Vasile Rus, Mihai Lintean, Roger Azevedo, Automatic Detection of Student Mental Models During Prior Knowledge Activation in MetaTutor
[18] Marian Simko, Maria Bielikova, Automatic Concept Relationships Discovery for an Adaptive E-course
[19] John Stamper, Ti any Barnes, An unsupervised, frequency-based metric for selecting hints in an MDP-based tutor
[20] Cesar Vialardi Sacin, Javier Bravo Agapito, Leila Shafti, Alvaro Ortigosa, Recommendation in Higher Education Using Data Mining Techniques

The proceedings are available for internal refenrce



Here lists a few articles in the Educational Data Mining area:


  1. J. Beck, “Workshop on Analyzing Student-Tutor Interaction Logs to Improve Educational Outcomes,” in Lecture notes in computer science vol. 3220: Springer-Verlag 2004.
  2. J. Mostow, “Some useful design tactics for mining ITS data,” in Proceedings of the ITS2004 Workshop on Analyzing Student-Tutor Interaction Logs to Improve Educational Outcomes, Alagoas, Brazil, 2004.
  3. V. Gueraud and J.-M. Cagnat, “Suivi à distance de classe virtuelle active,” in Proceedings of Technologies de l’Information et de la Connaissance dans l’Enseignement Supérieur et l’Industrie (TICE 2004), UTC Compiègne, France, 2004, pp. 377-383.
  4. P. Duval, A. Merceron, M. Scholl, and L. Wargon., “Empowering learning objects: an experiment with the ganesha platform,” in Proceedings of ED-MEDIA 2005, Montreal, Canada, 2005.
  5. R. Mazza and V. Dimitrova, “CourseVis: Externalising Student Information to Facilitate Instructors in Distance Learning,” in Proceedings of 11th International Conference on Artificial Intelligence in Education (AIED03), Sydney, Australia, 2003.
  6. B. Minaei-Bidgoli, D. A. Kashy, G. Kortemeyer, and W. F. Punch., “Predicting student performance: an application of data mining methods with the educational web-based system LON-CAPA,” in Proceedings of ASEE/IEEE Frontiers in Education Conference, 2003.
  7. A. Merceron and K. Yacef, “A Web-based Tutoring Tool with Mining Facilities to Improve Learning and Teaching,” in Proceedings of 11th International Conference on Artificial Intelligence in Education, Sydney, Australia, 2003, pp. 201-208.
  8. C. Romero, S. Ventura, C. d. Castro, W. Hall2, and M. H. Ng, “Using Genetic Algorithms for Data Mining in Web-based Educational Hypermedia Systems,” in Proceedings of AH2002 workshop Adaptive Systems for Web-based Education, Malaga, Spain, 2002.
  9. A. Merceron and K. Yacef, “Educational Data Mining: a Case Study,” in Artificial Intelligence in Education, AIED2005 Amsterdam, The Netherlands 2005.
  10. H. Cen, K. Koedinger, and B. Junker, “Automating Cognitive Model Improvement by A*Search and Logistic Regression,” in Proceedings of AAAI 2005 Workshop on Educational Data Mining, 2005.
  11. A. Jonsson, J. Johns, H. Mehranian, I. Arroyo, B. Woolf, A. Barto, D. Fisher, and S. Mahadevan., “Evaluating the Feasibility of Learning Student Models from Data,” in AAAI Workshop on Educational Data Mining Pittsburgh, PA, 2005.
  12. A. Dempster, N. Laird, and D. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistical Society Series, vol. B, pp. 1–38, 1977.
  13. T. Dean and K. Kanazawa, “A model for reasoning about persistence and causation,” International Journal of Computational Intelligence, vol. 5, pp. 142–150, 1989.
  14. M. Mayo and A. Mitrovic, “Optimising its behaviour with bayesian networks and decision theory,” International Journal of Artificial Intelligence in Education, vol. 12, pp. 124–153, 2001.
  15. T. Winters, C. Shelton, T. Payne, and G. Mei, “Topic Extraction from Item-Level Grades,” in AAAI-05 Workshop on Educational Data Mining, 2005.


Fulltext are available for internal reference