From AI Outputs to Searchable Knowledge
By Dimitris Xenakis
The role of SKIM in DaFab system architecture
Introduction: Turning a Copernicus data scale archive into something you can search
Copernicus has grown into an archive where the limiting factor is rarely the availability of pixels, but the ability to find the right ones. To do so, users still have to start from copernicus product descriptors (e.g. Sentinel-2 tile, date, processing level, etc…) and only later test whether the data contains the signal of interest. DaFab system addresses this gap by generating secondary, AI‑derived metadata at scale and exposing it as a discovery surface, so that users can begin with a thematic question instead of beginning with file selection (e.g. “How many agriculture parcels are there ?” in smart-agriculture thematic and “Where can I find water anomalies ?” for water-analysis one).
AI workflows can produce rich outputs as scores, masks, vectors and derived rasters but those results are not automatically reusable. Without stable identifiers, provenance and a consistent catalogue representation, outputs remain bound to a specific pipeline execution and are difficult to integrate into standard geospatial tooling. This is the point where SKIM (Semantic Knowledge IMprover) component becomes essential: it turns workflow outputs into catalogue files that can be indexed, navigated and referenced over time.
Why using STAC standard ?
DaFab system knowledge must remain readable by both machines and humans, and it must stay usable as models evolve. Vector embeddings are effective for similarity search, but they are model‑dependent and opaque: changing a backbone or retraining can change the vector space and invalidate downstream assumptions. For operational Earth Observation data management, the system needs a representation that remains interpretable without shipping a specific neural network alongside the data.
STAC (SpatioTemporal Asset Catalog) standard provides that representation. It anchors each result in space and time, links it into a navigable graph, and points to the produced artefacts through explicit assets. This turns AI outputs into deterministic objects that can be queried with standard spatio‑temporal constraints and consumed by generic users, while project‑specific payloads are difficul to handle.
In DaFab system, SKIM uses STAC standard as an output layer rather than an internal research format. The thematic payload produced by AI workflows is preserved under namespaced properties (e.g. dafab:water-analysisand dafab:smart-agriculture) so that STAC core remains stable while the thematic‑specific content can evolve.
SKIM responsibility in the DaFab system STAC structure
The STAC navigation service exposed to users is a shared product of several components and the division of responsabilities is intentional. DASI component publishes STAC-Items for original Earth Observation inputs (e.g. Sentinel‑2 L2A product) using native Copernicus metadata. SKIM component publishes STAC-Items for derived, AI‑enriched products and the facet value catalogues that enumerate those derived Items for browsing. RUCIO component exposes the STAC navigation surface: it serves the root and Collection documents, maintains the facet index catalogues that advertise available facet values, and updates cross‑links so that original observations can point to derived products.
This split keeps responsibilities stable as the system grows. Original Earth Observation Items can be ingested independently of downstream processing. Derived products can be republished when algorithms or parameters change, without requiring producers to own the global navigation skeleton. The catalogue backend can evolve its indexing and storage strategies while preserving the external STAC contract that users rely on, users being both machines and humans.
This is represented in the DaFab system high level architecture below.
Figure 1: DaFab system architecture
From a workflow run to a searchable knowledge object
A typical DaFab functional chain starts from an original Earth Observation Item and produces one or more thematic results. SKIM packages each thematic result as a STAC-Item in the appropriate derived STAC-Collection (e.g. water_analysis and smart_agriculture). The derived Item carries the observation footprint and a time anchor aligned with the source observation, so that it can be filtered consistently alongside the original data.
Provenance is explicit and navigable. When SKIM publishes a derived STAC-Item from it writes a forward provenance link (derived_from) back to the source original STAC-Item. RUCIO component then complements this with reverse links (related) on the original Earth Observation STAC-Item, enabling a second discovery path: starting from a Copernicus observation (’e.g. Sentinel-2’), an user can immediately see which derived products exist, without issuing cross‑collection queries.
To make the catalogue usable without requiring complex queries, DaFab system also supports deterministic browsing through facet navigation. SKIM component publishes the facet value catalogues that list the matching derived STAC-Items, and RUCIO component publishes the facet index catalogues that advertise which facet values exist. In the current pilot this is intentionally constrained to a small set of facets (e.g. water-basin and water-anomaly for water_analysis, agriculture-season for smart_agriculture) chosen for their stability and their relevance to early user exploration. The same derived STAC-Item can appear under multiple facet paths without duplication of content, because facets are link‑based views over a single store of STAC-Items.
The resulting navigation graph lets a client start at the STAC entry point, reach a Collection, narrow by facet, inspect derived Items, retrieve the referenced assets, and always navigate back to the source observation. The targeted navigation contract is captured in the companion “STAC Navigation diagram” described below, which is treated as the reference for link relations and traversal expectations.
This navigation diagram is natively extensible and compatible with Copernicus data scale, which allow DaFab system scaling.
Figure 2: STAC navigation diagram
Conclusion: SKIM creates searchable knowledge
SKIM component contribution to DaFab system, in cooperation with DASI and RUCIO components, is not to define the scientific meaning of the AI outputs, but to make those outputs stable and usable beyond a single processing run. By packaging derived results as STAC-Items, preserving the workflow payload under clear namespacing, and ensuring that provenance and browsing remain navigable through standard links and catalogues, SKIM provides the mechanism by which AI outputs become searchable knowledge in DaFab system.
Dimitris Xenakis (European Organization for Nuclear Research (CERN), Rucio - Scientific Data Management)