Serialization

Warning

Serialization is in draft currently. Once at least one implementation is ready, we will remove this warning and release a new version of the UHI helper package.

Introduction

Histogram serialization has to cover a wide range of formats. As such, we describe a form for serialization that covers the metadata structure as JSON-like, with a provided JSON-schema. The data (bins and/or variable edges) is stored out-of-band in a binary format based on what type of data file you are in. For very small (primarily 1D) histograms, data is allowed inline as well.

The following formats are being targeted:

┌────────┐ ┌────────┐ ┌───────┐
│  ROOT  │ │  HDF5  │ │  ZIP  │
└────────┘ └────────┘ └───────┘

Other formats can be used as well, assuming they support out-of-band data and text attributes or files for the metadata.

Caveats

This structure was based heavily on boost-histogram, but it is intended to be general, and can be expanded in the future as needed. As such, the following limitations are required:

  • Serialization followed by deserialisation may cause axis changes. Axis types may change to an equivalent but less performant axis, growth status will be lost, etc.

  • Metadata must be expressible as JSON. It should also be reasonably sized; some formats like HDF5 may limit the size of attributes to 64K.

  • Floating point errors could be incurred on conversion, as the storage format uses a stable but different representation.

  • Axis name is only part of the metadata, and is not standardized. This is due to lack of support from boost-histogram.

Design

The following axes types are supported:

  • "regular": A regularly spaced set of even bins. Boost-histogram’s “integer” axes maps to this axis as well. Has upper, lower, bins, underflow, overflow, and circular properties.

  • "variable": A continuous axis defined by bins+1 edges. Has edges, which is either an in-line list of numbers or a string pointing to an out-of-band data source. Also has underflow, overflow, and circular properties.

  • "category_int": A list of integer bins, non-continuous. Has categories, which is an in-line list of integers. Also has flow.

  • "category_str": A list of string bins. Has categories, which is an in-line list of strings. Also has flow.

  • "boolean": A true/false axis.

Axes with gaps are currently not supported.

All axes support metadata, a string-valued dictionary of arbitrary data. Currently, strings, numbers, and booleans are supported. Other values here are not currently supported.

The following storages are supported:

  • "int": A collection of integers. Boost-histogram’s Int64 and AtomicInt64 map to this, and sometimes Unlimited.

  • "double": A collection of 64-bit floating point values. Boost-histogram’s Double storage maps to this, and sometimes Unlimited.

  • "weighted": A collection of two arrays of 64-bit floating point values, "value" and "variance". Boost-histogram’s Weight storage maps to this.

  • "mean": A collection of three arrays of 64-bit floating point values, “count", "value", and "variance". Boost-histogram’s Mean storage maps to this.

  • "weighted_mean": A collection of four arrays of 64-bit floating point values, "sum_of_weights", "sum_of_weights_squared", "values", and "variances". Boost-histogram’s WeightedMean storage maps to this.

A library can fill the optional "writer_info" field with a key specific to the library containing library specific metadata anywhere a metadata field is allowed. There is one defined key at the Histogram level, "version", which contains the version of the library that created the histogram. Libraries should include this key when creating a histogram. It is not required for reading histograms. Histogram libraries can put custom metadata here that they can use to record province information or help with same-library round trips. For example, a histogram created with boost-histogram might contain:

{
  "writer_info": {
    "boost-histogram": {
      "version": "1.0.0",
    }
  }
  ...,
}

CLI/API

You can currently test a JSON file against the schema by running:

$ python -m uhi.schema some/file.json

Or with code:

import uhi.schema

with filename.open(encoding="utf-8") as f:
    data = json.load(f)

uhi.schema.validate(data)

Eventually this should also be usable for JSON’s inside zip, HDF5 attributes, and maybe more.

For static typing, you can use uhi.typing.serialization.Histogram to get a TypedDict of the schema.

Warning

Currently, this spec describes how to prepare the metadata for one of the targeted backends. It does not yet cover backend specific details, like how to define and use the binary resource locator strings or how to store the data. JSON is not a target spec, but just part of the ZIP spec, meaning the files that currently “pass” the tool above would be valid inside a .zip file eventually, but are not valid by themselves.

Rendered schema

Histogram

https://raw.githubusercontent.com/scikit-hep/uhi/main/src/uhi/resources/histogram.schema.json

type

object

patternProperties

  • .+

type

object

properties

  • writer_info

Information from the library that created the histogram.

type

object

patternProperties

  • .+

type

object

properties

  • version

Version of the library.

type

string

additionalProperties

True

additionalProperties

False

  • metadata

:ref:#/$defs/metadata

  • axes

A list of the axes of the histogram.

type

array

items

oneOf

:ref:#/$defs/regular_axis

:ref:#/$defs/variable_axis

:ref:#/$defs/category_str_axis

:ref:#/$defs/category_int_axis

:ref:#/$defs/boolean_axis

  • storage

The storage of the bins of the histogram.

oneOf

:ref:#/$defs/int_storage

:ref:#/$defs/double_storage

:ref:#/$defs/weighted_storage

:ref:#/$defs/mean_storage

:ref:#/$defs/weighted_mean_storage

additionalProperties

False

additionalProperties

False

$defs

  • supported_metadata

oneOf

type

string

type

number

type

boolean

  • metadata

Arbitrary metadata dictionary.

type

object

patternProperties

  • .+

:ref:#/$defs/supported_metadata

additionalProperties

False

  • writer_info

Information from the library that created the histogram.

type

object

patternProperties

  • .+

:ref:#/$defs/supported_metadata

additionalProperties

False

  • ndarray

A ND (nested) array of numbers.

type

array

items

oneOf

type

number

:ref:#/$defs/ndarray

  • data_array

oneOf

A path (similar to URI) to the floating point bin data

type

string

:ref:#/$defs/ndarray

  • regular_axis

An evenly spaced set of continuous bins.

type

object

properties

  • type

type

string

const

regular

  • lower

Lower edge of the axis.

type

number

  • upper

Upper edge of the axis.

type

number

  • bins

Number of bins in the axis.

type

integer

minimum

0

  • underflow

True if there is a bin for underflow.

type

boolean

  • overflow

True if there is a bin for overflow.

type

boolean

  • circular

True if the axis wraps around.

type

boolean

  • metadata

:ref:#/$defs/metadata

  • writer_info

:ref:#/$defs/writer_info

additionalProperties

False

  • variable_axis

A variably spaced set of continuous bins.

type

object

properties

  • type

type

string

const

variable

  • edges

oneOf

type

array

items

type

number

A path (URI?) to the edges data.

type

string

  • underflow

type

boolean

  • overflow

type

boolean

  • circular

type

boolean

  • metadata

:ref:#/$defs/metadata

  • writer_info

:ref:#/$defs/writer_info

additionalProperties

False

  • category_str_axis

A set of string categorical bins.

type

object

properties

  • type

type

string

const

category_str

  • categories

type

array

items

type

string

uniqueItems

True

  • flow

True if flow bin (at the overflow position) present.

type

boolean

  • metadata

:ref:#/$defs/metadata

  • writer_info

:ref:#/$defs/writer_info

additionalProperties

False

  • category_int_axis

A set of integer categorical bins in any order.

type

object

properties

  • type

type

string

const

category_int

  • categories

type

array

items

type

integer

uniqueItems

True

  • flow

True if flow bin (at the overflow position) present.

type

boolean

  • metadata

:ref:#/$defs/metadata

  • writer_info

:ref:#/$defs/writer_info

additionalProperties

False

  • boolean_axis

A simple true/false axis with no flow.

type

object

properties

  • type

type

string

const

boolean

  • metadata

:ref:#/$defs/metadata

  • writer_info

:ref:#/$defs/writer_info

additionalProperties

False

  • int_storage

A storage holding integer counts.

type

object

properties

  • type

type

string

const

int

  • values

:ref:#/$defs/data_array

additionalProperties

False

  • double_storage

A storage holding floating point counts.

type

object

properties

  • type

type

string

const

double

  • values

:ref:#/$defs/data_array

additionalProperties

False

  • weighted_storage

A storage holding floating point counts and variances.

type

object

properties

  • type

type

string

const

weighted

  • values

:ref:#/$defs/data_array

  • variances

:ref:#/$defs/data_array

additionalProperties

False

  • mean_storage

A storage holding ‘profile’-style floating point counts, values, and variances.

type

object

properties

  • type

type

string

const

mean

  • counts

:ref:#/$defs/data_array

  • values

:ref:#/$defs/data_array

  • variances

:ref:#/$defs/data_array

additionalProperties

False

  • weighted_mean_storage

A storage holding ‘profile’-style floating point ∑weights, ∑weights², values, and variances.

type

object

properties

  • type

type

string

const

weighted_mean

  • sum_of_weights

:ref:#/$defs/data_array

  • sum_of_weights_squared

:ref:#/$defs/data_array

  • values

:ref:#/$defs/data_array

  • variances

:ref:#/$defs/data_array

additionalProperties

False

Full schema

The full schema is below:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://raw.githubusercontent.com/scikit-hep/uhi/main/src/uhi/resources/histogram.schema.json",
  "title": "Histogram",
  "type": "object",
  "additionalProperties": false,
  "patternProperties": {
    ".+": {
      "type": "object",
      "required": ["axes", "storage"],
      "additionalProperties": false,
      "properties": {
        "writer_info": {
          "type": "object",
          "description": "Information from the library that created the histogram.",
          "additionalProperties": false,
          "patternProperties": {
            ".+": {
              "type": "object",
              "additionalProperties": true,
              "properties": {
                "version": {
                  "type": "string",
                  "description": "Version of the library."
                }
              }
            }
          }
        },
        "metadata": { "$ref": "#/$defs/metadata" },
        "axes": {
          "type": "array",
          "description": "A list of the axes of the histogram.",
          "items": {
            "oneOf": [
              { "$ref": "#/$defs/regular_axis" },
              { "$ref": "#/$defs/variable_axis" },
              { "$ref": "#/$defs/category_str_axis" },
              { "$ref": "#/$defs/category_int_axis" },
              { "$ref": "#/$defs/boolean_axis" }
            ]
          }
        },
        "storage": {
          "description": "The storage of the bins of the histogram.",
          "oneOf": [
            { "$ref": "#/$defs/int_storage" },
            { "$ref": "#/$defs/double_storage" },
            { "$ref": "#/$defs/weighted_storage" },
            { "$ref": "#/$defs/mean_storage" },
            { "$ref": "#/$defs/weighted_mean_storage" }
          ]
        }
      }
    }
  },
  "$defs": {
    "supported_metadata": {
      "oneOf": [
        { "type": "string" },
        { "type": "number" },
        { "type": "boolean" }
      ]
    },
    "metadata": {
      "type": "object",
      "description": "Arbitrary metadata dictionary.",
      "additionalProperties": false,
      "patternProperties": {
        ".+": { "$ref": "#/$defs/supported_metadata" }
      }
    },
    "writer_info": {
      "type": "object",
      "description": "Information from the library that created the histogram.",
      "additionalProperties": false,
      "patternProperties": {
        ".+": { "$ref": "#/$defs/supported_metadata" }
      }
    },
    "ndarray": {
      "type": "array",
      "items": {
        "oneOf": [{ "type": "number" }, { "$ref": "#/$defs/ndarray" }]
      },
      "description": "A ND (nested) array of numbers."
    },
    "data_array": {
      "oneOf": [
        {
          "type": "string",
          "description": "A path (similar to URI) to the floating point bin data"
        },
        {
          "$ref": "#/$defs/ndarray"
        }
      ]
    },
    "regular_axis": {
      "type": "object",
      "description": "An evenly spaced set of continuous bins.",
      "required": [
        "type",
        "lower",
        "upper",
        "bins",
        "underflow",
        "overflow",
        "circular"
      ],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "regular" },
        "lower": { "type": "number", "description": "Lower edge of the axis." },
        "upper": { "type": "number", "description": "Upper edge of the axis." },
        "bins": {
          "type": "integer",
          "minimum": 0,
          "description": "Number of bins in the axis."
        },
        "underflow": {
          "type": "boolean",
          "description": "True if there is a bin for underflow."
        },
        "overflow": {
          "type": "boolean",
          "description": "True if there is a bin for overflow."
        },
        "circular": {
          "type": "boolean",
          "description": "True if the axis wraps around."
        },
        "metadata": { "$ref": "#/$defs/metadata" },
        "writer_info": { "$ref": "#/$defs/writer_info" }
      }
    },
    "variable_axis": {
      "type": "object",
      "description": "A variably spaced set of continuous bins.",
      "required": ["type", "edges", "underflow", "overflow", "circular"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "variable" },
        "edges": {
          "oneOf": [
            {
              "type": "array",
              "items": { "type": "number", "minItems": 2, "uniqueItems": true }
            },
            {
              "type": "string",
              "description": "A path (URI?) to the edges data."
            }
          ]
        },
        "underflow": { "type": "boolean" },
        "overflow": { "type": "boolean" },
        "circular": { "type": "boolean" },
        "metadata": { "$ref": "#/$defs/metadata" },
        "writer_info": { "$ref": "#/$defs/writer_info" }
      }
    },
    "category_str_axis": {
      "type": "object",
      "description": "A set of string categorical bins.",
      "required": ["type", "categories", "flow"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "category_str" },
        "categories": {
          "type": "array",
          "items": { "type": "string" },
          "uniqueItems": true
        },
        "flow": {
          "type": "boolean",
          "description": "True if flow bin (at the overflow position) present."
        },
        "metadata": { "$ref": "#/$defs/metadata" },
        "writer_info": { "$ref": "#/$defs/writer_info" }
      }
    },
    "category_int_axis": {
      "type": "object",
      "description": "A set of integer categorical bins in any order.",
      "required": ["type", "categories", "flow"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "category_int" },
        "categories": {
          "type": "array",
          "items": { "type": "integer" },
          "uniqueItems": true
        },
        "flow": {
          "type": "boolean",
          "description": "True if flow bin (at the overflow position) present."
        },
        "metadata": { "$ref": "#/$defs/metadata" },
        "writer_info": { "$ref": "#/$defs/writer_info" }
      }
    },
    "boolean_axis": {
      "type": "object",
      "description": "A simple true/false axis with no flow.",
      "required": ["type"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "boolean" },
        "metadata": { "$ref": "#/$defs/metadata" },
        "writer_info": { "$ref": "#/$defs/writer_info" }
      }
    },
    "int_storage": {
      "type": "object",
      "description": "A storage holding integer counts.",
      "required": ["type", "values"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "int" },
        "values": { "$ref": "#/$defs/data_array" }
      }
    },
    "double_storage": {
      "type": "object",
      "description": "A storage holding floating point counts.",
      "required": ["type", "values"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "double" },
        "values": { "$ref": "#/$defs/data_array" }
      }
    },
    "weighted_storage": {
      "type": "object",
      "description": "A storage holding floating point counts and variances.",
      "required": ["type", "values", "variances"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "weighted" },
        "values": { "$ref": "#/$defs/data_array" },
        "variances": { "$ref": "#/$defs/data_array" }
      }
    },
    "mean_storage": {
      "type": "object",
      "description": "A storage holding 'profile'-style floating point counts, values, and variances.",
      "required": ["type", "counts", "values", "variances"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "mean" },
        "counts": { "$ref": "#/$defs/data_array" },
        "values": { "$ref": "#/$defs/data_array" },
        "variances": { "$ref": "#/$defs/data_array" }
      }
    },
    "weighted_mean_storage": {
      "type": "object",
      "description": "A storage holding 'profile'-style floating point ∑weights, ∑weights², values, and variances.",
      "required": [
        "type",
        "sum_of_weights",
        "sum_of_weights_squared",
        "values",
        "variances"
      ],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "weighted_mean" },
        "sum_of_weights": { "$ref": "#/$defs/data_array" },
        "sum_of_weights_squared": { "$ref": "#/$defs/data_array" },
        "values": { "$ref": "#/$defs/data_array" },
        "variances": { "$ref": "#/$defs/data_array" }
      }
    }
  }
}