SemanticGPT In-Class User Study: VR vs. 2D Comparison

Sanil Desai, Spring 2026

Overview

A within-subjects, counterbalanced crossover study comparing a VR visualization of GPT-2 attention/embedding trajectories (SemanticGPT, on Meta Quest 3) against an equivalent 2D Unity web visualization. Goal: test whether immersive 3D adds anything over a flat screen for understanding how transformer layers progressively resolve word ambiguity.

Methodology

Participants: 8 classmates from CSCI 1951T.

Design: Counterbalanced crossover, two order groups of 4 participants each. Group A did VR first then 2D. Group B did 2D first then VR.

Stimuli: Four polysemous-word sentences with known correct clusters — bat_cave (Animals), bat_swing (Sports), bank_deposit (Finance), bank_river (Nature). In each modality, participants stepped through the 13 GPT-2 transformer layers and identified which semantic cluster the target word migrated toward.

Measures (collected via Google Form): cluster identification accuracy, NASA-TLX subscales (Mental Demand, Physical Demand, Effort, Frustration; 1–7 paired), three preference comparisons (cluster identification, path understanding, learning), self-rated confidence in explaining contextual embeddings, and open-response fields for what each modality made easier or harder.

Results

Accuracy — no measurable difference

VR: 9/16 correct. 2D: 8/16 correct. Per-participant Wilcoxon test: W = 16.0, p = 1.00. Differences split evenly in both directions. Two sentences (bat_swing, bank_river) were hard in both modalities and pulled the ceiling down.

NASA-TLX — medium effects favor VR on Effort and Frustration

All scales are 1–7 with lower being better. Effect size r is rank-biserial.

Mental Demand: VR mean 3.50, 2D mean 3.25, r = 0.14 (negligible)
Physical Demand: VR mean 2.25, 2D mean 1.62, r = 0.48 (2D easier, medium effect)
Effort: VR mean 3.00, 2D mean 3.75, r = 0.43 (VR easier, medium effect)
Frustration: VR mean 2.75, 2D mean 3.38, r = 0.39 (VR less frustrating, medium effect)

No subscale reaches conventional significance at n = 8, but the effect-size pattern is consistent: VR is slightly more physically demanding, but participants reported working less hard and feeling less frustrated.

Preferences — unanimous on cluster identification

Easier to identify clusters: 8/8 (100%) preferred VR. Zero same. Zero preferred 2D.
Easier to understand path: 6/8 (75%) preferred VR. 2/8 preferred 2D.
Prefer for learning AI/NLP: 6/8 (75%) preferred VR. 1/8 same. 1/8 preferred 2D.

Unanimity on cluster identification is the headline finding: binomial p = 0.008, statistically significant despite the small sample. Four participants said "much easier in VR," four said "slightly easier in VR," zero favored 2D.

Qualitative Themes

VR advantages. Participants consistently called out 3D spatial separation as the key win: clusters that overlapped in 2D became visibly distinct in VR, and physical movement around the scene made distance estimation more intuitive. Representative quote: "Much easier to see spatial differences between word clusters while moving around in VR."

2D advantage. The one place 2D came out ahead was tracing the word's path across layers — easier to see drastic shifts from a fixed external viewpoint. "The path the words followed was slightly easier to notice in 2D."

Suggestions from participants. Add depth shading on spheres, ground-plane shadows, a color-distance gradient (gray for far clusters, saturated for near), and explore an AR variant.

Takeaways

VR is unambiguously better for identifying which sense a word resolves to (8/8 unanimous).
VR reduces subjective workload (Effort, Frustration) at a small cost in Physical Demand — a reasonable trade for exploratory tasks.
Path understanding is the one place 2D still has a niche; sequential transitions read more cleanly on a flat screen.
Modality did not change accuracy — VR helps users feel more confident, but does not help them get the right answer on hard sentences.

Limitations

n = 8 with Wilcoxon paired tests sits on a discrete-distribution floor that makes most NASA-TLX subscales unable to reach conventional significance even with medium effects. The binomial preference test handles this cleanly, which is why cluster identification is the strongest result. Two stimulus sentences were hard in both conditions and would benefit from refinement before a larger replication.

Spring 2026 Project 1 Complete Walkthrough
Tutorial: Visualizing Word Embeddings in 3D Using Unity + Meta Quest
Comparison Table: 2D Projection vs Unity 3D
Word2Vec to Unity VR Pipeline Tutorial

Sanil Desai (Spring 2026)

Page updated

Google Sites

Report abuse