Rucio’s New Metadata Intelligence
By Dimitris Xenakis, Martin Barisits
Usability, Impact, and a New Horizon for DaFab and the Global Rucio Community
Over the past year, the DaFab project has become a catalyst for the evolution of the Rucio data management system. While initially designed to support the ATLAS experiment at Cern, today Rucio serves a far wider community of scientific collaborations with complex data needs. The DaFab initiative, centered on extracting value from massive Copernicus Earth Observation archives, has pushed Rucio into new territory, beyond file cataloguing and distributed data placement, and into the realm of rich semantic metadata and powerful filtering.
The Pressure of Scale
Earth Observation data arrives in enormous volumes and structural diversity. DaFab’s use cases, such as field-boundary delineation for agriculture and large-scale water-anomaly detection, require not only access to petabytes of imagery but also to complex metadata describing their multi-sensor spatio-temporal footprints. As documented in DaFab’s Unified Catalogue Format Description, the metadata stack includes full STAC items. Namely, hundreds of lines of nested JSON per product, with arrays, objects, and heterogeneous types.
Rucio’s traditional metadata engine, built around simple key-value attributes and limited Boolean logic, was never designed for this level of semantic complexity. Yet DaFab’s unified catalogue, which aggregates both native Copernicus metadata and AI-generated descriptive metadata, depends on exactly these capabilities in order to support intuitive, domain-driven queries such as: find all Sentinel-2 tiles overlapping Luxembourg, acquired within a given date range, with agricultural coverage above a threshold and minimal cloud presence.
Enabling such capabilities, not only transforms DaFab workflows, but also stands to benefit every Rucio-based experiment, which increasingly rely on metadata ecosystems: Belle II from TimeScaleDB and LLM-driven query interfaces [*]; LSST/Rubin combines Rucio metadata with Kafka-driven curation systems [*]; DUNE requires scalable metadata to support new token-based infrastructures [*]; CTAO and SKA require hierarchical metadata structures to map science products and data provenance [*]. All of these communities presented their needs at the 8th Rucio Community Workshop, reinforcing that metadata is becoming a first-class citizen for scientific data management.
Modern Metadata Management
The first major feature being developed for DaFab is a robust JSON metadata management system integrated directly into Rucio’s server-core layers. It allows users and upstream systems to retrieve, insert, update, upsert, or delete arbitrary metadata in a JSON document, while enforcing predictable semantics across atomic or best-effort operations. A change which fundamentally expands the expressiveness and flexibility of Rucio’s metadata model. Instead of treating metadata as a flat map, Rucio can now host arbitrarily structured metadata, including deeply nested objects, arrays, and mixed-type fields; exactly what STAC requires.
With flexible metadata comes the need for structure. DaFab’s metadata model depends on schemas that evolve over time, both for Copernicus-native metadata and AI-derived semantic layers. To maintain consistency, Rucio is being extended with schema versioning and server-side validation, ensuring that any metadata update respects a currently enforced schema and that historical metadata remain associated with their schema lineage.
Turning Metadata into Discovery
With metadata in place, discovery depends on being able to query it. Developed in response to DaFab’s discovery requirements and presented at the 8th Rucio Community Workshop, the enhanced filter introduces a fully structured, composable query language which is database agnostic, supporting:
- nested path traversal,
- array indexing,
- rich comparison operators,
- and arbitrary logical expressions, including multi-layered AND/OR/NOT constructs.
This represents a break from Rucio’s previous model, which only supported shallow AND/OR sets of flat metadata keys. The new filter allows complex scientific queries to be expressed precisely and executed efficiently. A user can now search for metadata where a nested field matches a value, where an array contains an element satisfying a condition, or where multiple conditions combine into a higher-level semantic rule. For DaFab, this finally enables the type of discovery that the unified metadata catalogue was designed for: selecting scenes based on spatial predicates, temporal windows, agricultural coverage percentages, cloud-free fractions, anomaly scores, or AI-generated geometries.
Metadata as the Next Frontier
DaFab has helped turn metadata from a supporting detail into a core design axis for Rucio. With structured JSON metadata, schema governance, and a composable filtering language, Rucio is positioned not only to serve Copernicus-scale Earth Observation workflows, but to support metadata-driven discovery across a growing range of data-intensive sciences.
Dimitris Xenakis
(European Organization for Nuclear Research (CERN), Rucio - Scientific Data Management)
Martin Barisits
(European Organization for Nuclear Research (CERN), Rucio - Scientific Data Management)