The Index for Everything

Data collection is a solved problem. Search isn't.

Petabytes of autonomous vehicle footage sit in cold storage. Molecular structures that could unlock next-generation batteries have never been queried. Surveillance archives go unsearched. Advancements in multimodal embeddings - models that can jointly annotate, label, and parse thousands of data points across video, sensor data, and molecular structures - are improving search at scale.

‍The "Google for Materials"

‍Discovering a new material has traditionally taken years of physical experimentation. The possibility space - every molecular configuration that could theoretically exist - is functionally infinite, and researchers have always been able to explore only a fraction of it. Not because the data didn't exist, but because there was no way to search it.

‍CuspAI is building the index for the rest, training its models on proprietary materials datasets, licensed scientific literature, and experimental feedback piped in directly from commercial partners. Where AlphaFold taught us the shape of biological molecules that already exist, CuspAI is designing for the unknown. The platform inverts the traditional scientific method: rather than starting with a material and asking what it can do, scientists describe the properties they need and the platform works backwards - generating candidates, running each through a built-in physics engine that simulates electron density and structural stability, and adjusting based on that feedback. The result is a ranked list of synthesizable structures, some known materials never considered for this application, others novel configurations that don't yet exist.

‍Meta, Georgia Tech, And Cusp AI Launch The Largest Open Direct Air Capture Dataset

‍[According to the official release by Meta, this dataset is powered by machine learning models that simulate and predict the properties of sorbent materials, allowing users to rapidly go through potential choices]

‍Making Invisible Data Visible
‍
‍The autonomous vehicle industry has a data problem - not a shortage of it, but a surplus. A single vehicle can generate terabytes of footage daily; multiply that across an entire fleet and you have a petabyte-scale archive that is, for practical purposes, invisible. Companies developing self-driving cars, robots, and autonomous construction equipment collect millions of hours of footage, and until recently, organizing it was a human job. Someone had to watch it. Even on fast-forward, how can that scale?

Nomadic converts raw footage into a structured, searchable dataset by processing RGB camera feeds, LiDAR point clouds, and vehicle telemetry jointly within a single model. The smarter trick is what happens after indexing - query "find instances of aggressive lane-merging under three seconds headway" and the system extracts the math behind the scene, surfacing edge cases and feeding them directly into training pipelines.

In security, Conntour is making previously unsearchable surveillance archives instantly queryable. Most organizations run hundreds or thousands of cameras around the clock, but legacy systems can only flag what they were pre-programmed to detect - anything outside those predefined categories disappears into the footage forever. Conntour lets security teams query footage the way you'd run a search - "find every instance of someone passing a bag near the south entrance after midnight" - returning results in real time across thousands of cameras simultaneously.

Whether the data is sitting untouched in an AV archive, locked in the laws of chemistry, or buried across thousands of surveillance feeds, the moat goes to whoever builds the best retrieval layer for their specific domain. CuspAI, Nomadic, and Conntour are each doing this in fundamentally different ways, but the same opportunity runs through all three: the world is full of datasets that exist but have never been made legible.

A Functional Taxonomy of World Models

⁍ “Three-dimensional data with explicit geometry, material properties, and physical annotations is orders of magnitude scarcer than the internet video that renderers train on”
⁍ "Language gave machines a way to talk about that world. World models are how machines will finally come to understand, imagine, reason and interact with it"