Scientific Visualization Connection: Scalar and Vector Fields Applied to Language Embeddings

Overview

Word embeddings are typically treated as point data — each word is a location in space, and visualization means showing those locations as dots. This project reframes the same data as a scientific field: a continuous function defined everywhere in space, sampled and reconstructed from the sparse word positions. This framing connects directly to the core methodology of scientific visualization, which has used scalar and vector field techniques for decades in domains from fluid dynamics to meteorology to neuroscience.

The Scientific Field Analogy

In computational fluid dynamics (CFD), a velocity field assigns a vector to every point in space, describing the direction and speed of fluid flow. In meteorology, a pressure field assigns a scalar to every point, and gradient lines show how pressure changes across a weather system. In MRI diffusion tensor imaging, a tensor field describes the preferred direction of water molecule diffusion at every voxel, revealing white matter tract structure in the brain.

This project applies the same representational structure to semantic space:

The KDE density field is the scalar field — f(x, y, z) gives the semantic intensity of each category at every point
The gradient of the density field is the vector field — ∇f(x, y, z) gives the direction of steepest semantic increase at every point
The streamlines integrated along the gradient are the equivalent of velocity streamlines in CFD — paths that follow the flow from lower to higher semantic concentration
The fog cloud volumes rendered from density thresholds are the equivalent of isosurfaces in volume rendering — surfaces of constant field value that reveal the shape of the semantic landscape

The only difference from traditional scientific field visualization is the origin of the data: instead of solving partial differential equations or measuring physical quantities, the field is reconstructed from sparse samples (word positions) using kernel density estimation.

Why This Framing Matters

Treating embeddings as field data rather than point data changes what can be visualized and what questions can be asked.

Point data answers: Which words are close to which other words? Which cluster does this word belong to?

Field data answers: How does meaning change as you move from one region to another? Where are the boundaries between semantic categories? Is the transition from Emotions to Moral Concepts gradual or sharp? Which regions act as semantic attractors — pulling the gradient flow toward them — and which are saddle points where two categories compete?

These are structurally different questions, and they require different visualization techniques to answer. A scatterplot cannot show the direction of semantic change. A scalar field can.

Technical Correspondence

This Project --> Scientific Visualization

KDE density grid --> Scalar field (pressure, temperature, concentration)
numpy.gradient on density --> Central-difference gradient approximation
Gradient vector field --> Velocity field / force field
RK2 streamline integration --> Standard streamline integration in CFD
Fog volume voxels --> Volume rendering / isosurface rendering
Density threshold --> Isovalue in marching cubes / isosurface extraction
Probe % readout --> Field probe / point query in scientific visualization tools
Category boundary (where gradient splits) --> Separatrix in dynamical systems / watershed boundary

The RK2 integration used in FlowLineRenderer.cs is the same midpoint method used in ParaView, VTK, and other standard scientific visualization tools for integrating streamlines through velocity fields. The KDE reconstruction used in build_scalar_field.py is functionally equivalent to the kernel smoothing used in particle-based fluid simulations to reconstruct continuous pressure fields from discrete particle positions.

When This Approach Generalizes

The scalar field + gradient flow pipeline used here is not specific to word embeddings. Any dataset that can be embedded in 3D space and assigned to categories or scalar values can be visualized using the same approach:

Image embeddings from a CNN — regions of visual similarity rendered as density fields
Sentence embeddings from a transformer — topic clusters visualized as semantic terrain
Protein embeddings from a structure prediction model — functional similarity regions shown as field landscape
Any tabular dataset after UMAP or t-SNE reduction — the field reconstruction step is upstream of any domain-specific meaning

The pipeline steps are: embed → reduce to 3D → KDE per category → gradient → export → render. All five steps are domain-agnostic.

Limitations Compared to Physical Field Visualization

The analogy to physical fields has real limits worth being explicit about.

The reconstructed field is not the true field. KDE is an estimate of the density from which the word positions were sampled — it is not a measurement of an actual underlying continuous function. The gradient of a KDE estimate has high variance in sparse regions, which is why streamlines outside the dense cluster areas are unreliable and were excluded by the minDensityToStart threshold.

The gradient does not have physical units. In CFD, a pressure gradient has units of Pascals per meter and directly corresponds to a force. The semantic gradient has no such interpretation — it is a direction in an arbitrary coordinate system produced by UMAP, which does not preserve global distances. Moving "uphill" in the semantic gradient means moving toward higher KDE density, not toward any physically meaningful quantity.

The field is static. Physical fields evolve over time — weather systems move, fluid flows change. The semantic field here is computed offline from a fixed set of word positions and does not change. A dynamic semantic field — for example, tracking how a language model's internal representations shift during fine-tuning — would be a natural extension of this work but would require animated field rendering.

Grid resolution limits accuracy. The 20×20×20 grid used here has a cell size of approximately 12.5cm in a 2.5m cloud. Features smaller than this resolution — tight word clusters, sharp boundaries — are smoothed out by the grid. Higher resolution improves accuracy but increases rendering cost cubically.

Page updated

Google Sites

Report abuse