Design of a Distributed Semantic Analysis System for Scalable Processing of Heterogeneous Text Data

Abstract

Quantifying the semantic divergence of student questioning in large-scale tutor dialogue requires more than pairwise cosine distance. We propose CSVI (Core Semantic Variance Index), based on the eigenvalue spectrum of the embedding covariance matrix, paired with an Apache Spark distributed analytics service. CSVI outperforms ACD and entropy baselines across sentence, dialogue, and document granularities (+3.81 to +13.90 percentage points).

Problem & Motivation

Dialogue text accumulating from large-scale AI course tutors spans sentence-, dialogue-, and document-level granularities. Existing methods for quantifying semantic divergence of student questioning rely on term frequency, TF-IDF distance, or pairwise cosine distance — capturing only local pairwise relationships, failing to characterize the global structure of high-dimensional semantic space, and producing inconsistent readings across text types.

Method

We propose CSVI (Core Semantic Variance Index): after embedding the text, compute the eigenvalue spectrum of the covariance matrix, then quantify the effective dimensionality of the semantic distribution via Participation Ratio; a sample-size correction compresses the score into 0–1 for comparability across dialogue lengths. At the system level we implement an Apache-Spark-based distributed semantic-analytics service with three independently scalable stages: ingestion & preprocessing (distributed structure detection, length-based partitioning, text cleaning), semantic embedding (Transformer inference with broadcast model weights to minimize shuffle cost), and CSVI computation (PCA-based eigen-decomposition on Spark MLlib distributed linear-algebra primitives). Stages exchange via in-memory pipelines to minimize I/O, with partitioned key-value stores tracking per-user semantic-divergence trajectories.

Findings

Multilingual-STSB (sentence-level): CSVI 86.35% / ACD 82.54% / Entropy 73.55%.
Python Class GPT Dialogue (dialogue-level): CSVI 93.05% / ACD 79.15% / Entropy 65.34%.
20 Newsgroups (document-level): CSVI 88.32% / ACD 77.75% / Entropy 70.62%.
CSVI leads ACD by +3.81 / +13.90 / +10.57 percentage points respectively across the three corpora.
Embedding-based methods (CSVI, ACD) outperform lexical-statistical methods (Entropy) overall.

Implications

Quantifying the semantic divergence and convergence of student questioning gives instructors a way to infer learner cognitive state and depth of understanding without relying solely on exam scores. CSVI provides a fine-grained, continuous view of the learning process and can underpin differentiated instruction and adaptive intervention; paired with the distributed service architecture, semantic analysis remains timely at the scale of full courses and supports cross-course / cross-class semantic-trend monitoring.

Citation

Y.-Y. Chang, M.-C. Tsai, Y.-C. Chien, Y.-Z. Chai, and C.-K. Chang, “Design of a Distributed Semantic Analysis System for Scalable Processing of Heterogeneous Text Data,” in IEEE BigDataService 2026, 2026.

BibTeX

@inproceedings{chang2026distributed_semantic,
  author    = {Yan-Yu Chang and Min-Chun Tsai and Yu-Chen Chien and Yu-Zhen Chai and Chia-Kai Chang},
  title     = {Design of a Distributed Semantic Analysis System for Scalable Processing of Heterogeneous Text Data},
  booktitle = {Proc. IEEE Int. Conf. on Big Data Computing Service and Machine Learning Applications (BigDataService)},
  address   = {Fukuoka, Japan},
  year      = {2026},
  month     = jul,
}

From Wearables to Classrooms: A Person-Centered Feasibility Study of HRV-Based Physiological Monitoring for Learning Analytics

Leveraging Knowledge Graphs and Large Language Models to Track and Analyze Learning Trajectories