Did You Lose Them? Predicting the Exact Moment of Disengagement via Multimodal VLM Classroom Orchestration in Education

Abstract

We propose a multimodal vision-language model framework as a digital teaching coordinator that integrates visual behavior (YOLOv12 + DINOv2), slide semantics (TrOCR), and teacher speech (Fairseq S2T) and fuses them via a Llama-2-based VLM. Deployed in a Python programming course, the system reaches 83.22% overall engagement accuracy with 81.62% precision on disengagement onset at 12.3 s average latency.

Problem & Motivation

In traditional classrooms instructors struggle to detect the dynamic shifts in student attention in real time, and by the time disengagement is obvious the teachable moment has often passed. Existing AI monitoring systems focus on a single visual modality, lack pedagogical context, and can induce surveillance anxiety that lowers engagement further.

Method

We propose a multimodal vision-language model framework serving as a digital teaching coordinator. It integrates three data streams — privacy-preserving visual-behavior features (YOLOv12 + DINOv2), slide semantic complexity (TrOCR), and teacher speech patterns (Fairseq S2T) — and fuses them via a Llama-2-based VLM for cross-modal reasoning. The system was evaluated in an undergraduate Python programming course at National Central University with over 30 students.

Findings

83.22% overall accuracy across five engagement classes, with disengagement F1 = 0.81.
81.62% precision on disengagement onset, with mean detection latency of 12.3 seconds.
Multimodal fusion materially outperforms a vision-only baseline (67.31%).
The VLM produces context-aware pedagogical suggestions such as simplifying content, slowing pace, or increasing interactivity.

Implications

The work argues for a paradigm shift toward Synchronous Pedagogy, where AI is positioned not as a monitoring tool but as an ethical cognitive co-pilot — coordinating instructional intent with student engagement while preserving privacy and dignity, and letting instructors adjust strategy in-flight rather than post-hoc.

Citation

E. N. Furqon and C.-K. Chang, “Did You Lose Them? Predicting the Exact Moment of Disengagement via Multimodal VLM Classroom Orchestration in Education,” in IEEE ICALT 2026, 2026.

BibTeX

@inproceedings{furqon2026vlm_disengagement,
  author    = {Elvin Nur Furqon and Chia-Kai Chang},
  title     = {Did You Lose Them? Predicting the Exact Moment of Disengagement via Multimodal {VLM} Classroom Orchestration in Education},
  booktitle = {Proc. IEEE Int. Conf. on Advanced Learning Technologies (ICALT)},
  year      = {2026},
  month     = jul,
}

C-GRASP: Clinically-Grounded Reasoning for Affective Signal Processing

Validation of Natural Language–Based Educational Digital Twins through Embedding Geometry in Python Courses