Serialization¶

Warning

Serialization is in draft currently. Once at least one implementation is ready, we will remove this warning and release UHI 0.5.

Introduction¶

Histogram serialization has to cover a wide range of formats. As such, we describe a form for serialization that covers the metadata structure as JSON-like, with a provided JSON-schema. The data (bins and/or variable edges) is stored out-of-band in a binary format based on what type of data file you are in. For very small (primarily 1D) histograms, data is allowed inline as well.

The following formats are being targeted:

┌────────┐ ┌────────┐ ┌───────┐
│  ROOT  │ │  HDF5  │ │  ZIP  │
└────────┘ └────────┘ └───────┘

Other formats can be used as well, assuming they support out-of-band data and text attributes or files for the metadata.

Caveats¶

This structure was based heavily on boost-histogram, but it is intended to be general, and can be expanded in the future as needed. As such, the following limitations are required:

Serialization followed by deserialisation may cause axis changes. Axis types may change to an equivalent but less performant axis, growth status will be lost, etc.
Metadata must be expressible as JSON. It should also be reasonably sized; some formats like HDF5 may limit the size of attributes to 64K.
Floating point errors could be incurred on conversion, as the storage format uses a stable but different representation.
Axis name is only part of the metadata, and is not standardized. This is due to lack of support from boost-histogram.

Design¶

The following axes types are supported:

"regular": A regularly spaced set of even bins. Boost-histogram’s “integer” axes maps to this axis as well. Has upper, lower, bins, underflow, overflow, and circular properties. circular defaults to False if not present.
"variable": A continuous axis defined by bins+1 edges. Has edges, which is either an in-line list of numbers or a string pointing to an out-of-band data source. Also has underflow, overflow, and circular properties. circular defaults to False if not present.
"category_int": A list of integer bins, non-continuous. Has categories, which is an in-line list of integers. Also has flow.
"category_str": A list of string bins. Has categories, which is an in-line list of strings. Also has flow.
"boolean": A true/false axis.

Axes with gaps are currently not supported.

All axes support metadata, a string-valued dictionary of arbitrary, JSON-like data.

The following storages are supported:

"int": A collection of integers. Boost-histogram’s Int64 and AtomicInt64 map to this, and sometimes Unlimited.
"double": A collection of 64-bit floating point values. Boost-histogram’s Double storage maps to this, and sometimes Unlimited.
"weighted": A collection of two arrays of 64-bit floating point values, "value" and "variance". Boost-histogram’s Weight storage maps to this.
"mean": A collection of three arrays of 64-bit floating point values, “count”, “value”, and “variance”. Boost-histogram’s Mean storage maps to this.
"weighted_mean": A collection of four arrays of 64-bit floating point values, "sum_of_weights", "sum_of_weights_squared", "values", and "variances". Boost-histogram’s WeighedMean storage maps to this.

CLI/API¶

You can currently test a JSON file against the schema by running:

$ python -m uhi.schema some/file.json

Or with code:

import uhi.schema

uhi.schema.validate("some/file.json")

Eventually this should also be usable for JSON’s inside zip, HDF5 attributes, and maybe more.

Warning

Currently, this spec describes how to prepare the metadata for one of the targeted backends. It does not yet cover backend specific details, like how to define and use the binary resource locator strings or how to store the data. JSON is not a target spec, but just part of the ZIP spec, meaning the files that currently “pass” the tool above would be valid inside a .zip file eventually, but are not valid by themselves.

Rendered schema¶

Histogram¶

https://raw.githubusercontent.com/scikit-hep/uhi/main/src/uhi/resources/histogram.schema.json
type	object
patternProperties
.+	type	object
	properties
	metadata	Arbitrary metadata dictionary.
		type	object
	axes	A list of the axes of the histogram.
		type	array
		items	oneOf	:ref:`#/$defs/regular_axis`
				:ref:`#/$defs/variable_axis`
				:ref:`#/$defs/category_str_axis`
				:ref:`#/$defs/category_int_axis`
				:ref:`#/$defs/boolean_axis`
	storage	The storage of the bins of the histogram.
		oneOf	:ref:`#/$defs/int_storage`
			:ref:`#/$defs/double_storage`
			:ref:`#/$defs/weighted_storage`
			:ref:`#/$defs/mean_storage`
			:ref:`#/$defs/weighted_mean_storage`
	additionalProperties	False
additionalProperties	False
$defs
regular_axis	An evenly spaced set of continuous bins.
	type	object
	properties
	type	type	string
		const	regular
	lower	Lower edge of the axis.
		type	number
	upper	Upper edge of the axis.
		type	number
	bins	Number of bins in the axis.
		type	integer
		minimum	0
	underflow	True if there is a bin for underflow.
		type	boolean
	overflow	True if there is a bin for overflow.
		type	boolean
	circular	True if the axis wraps around.
		type	boolean
	metadata	Arbitrary metadata dictionary.
		type	object
	additionalProperties	False
variable_axis	A variably spaced set of continuous bins.
	type	object
	properties
	type	type	string
		const	variable
	edges	oneOf	type	array
			items	type	number
			A path (URI?) to the edges data.
			type	string
	underflow	type	boolean
	overflow	type	boolean
	circular	type	boolean
	metadata	Arbitrary metadata dictionary.
		type	object
	additionalProperties	False
category_str_axis	A set of string categorical bins.
	type	object
	properties
	type	type	string
		const	category_str
	categories	type	array
		items	type	string
		uniqueItems	True
	flow	True if flow bin (at the overflow position) present.
		type	boolean
	metadata	Arbitrary metadata dictionary.
		type	object
	additionalProperties	False
category_int_axis	A set of integer categorical bins in any order.
	type	object
	properties
	type	type	string
		const	category_int
	categories	type	array
		items	type	integer
		uniqueItems	True
	flow	True if flow bin (at the overflow position) present.
		type	boolean
	metadata	Arbitrary metadata dictionary.
		type	object
	additionalProperties	False
boolean_axis	A simple true/false axis with no flow.
	type	object
	properties
	type	type	string
		const	boolean
	metadata	Arbitrary metadata dictionary.
		type	object
	additionalProperties	False
int_storage	A storage holding integer counts.
	type	object
	properties
	type	type	string
		const	int
	data	oneOf	A path (URI?) to the integer bin data.
			type	string
			type	array
			items	type	integer
	additionalProperties	False
double_storage	A storage holding floating point counts.
	type	object
	properties
	type	type	string
		const	double
	data	oneOf	A path (URI?) to the floating point bin data.
			type	string
			type	array
			items	type	number
	additionalProperties	False
weighted_storage	A storage holding floating point counts and variances.
	type	object
	properties
	type	type	string
		const	int
	data	oneOf	A path (URI?) to the floating point bin data; outer dimension is [value, variance]
			type	string
			type	object
			properties
			values	type	array
				items	type	number
			variances	type	array
				items	type	number
			additionalProperties	False
	additionalProperties	False
mean_storage	A storage holding ‘profile’-style floating point counts, values, and variances.
	type	object
	properties
	type	type	string
		const	int
	data	oneOf	A path (URI?) to the floating point bin data; outer dimension is [counts, value, variance]
			type	string
			type	object
			properties
			counts	type	array
				items	type	number
			values	type	array
				items	type	number
			variances	type	array
				items	type	number
			additionalProperties	False
	additionalProperties	False
weighted_mean_storage	A storage holding ‘profile’-style floating point ∑weights, ∑weights², values, and variances.
	type	object
	properties
	type	type	string
		const	int
	data	oneOf	A path (URI?) to the floating point bin data; outer dimension is [∑weights, ∑weights², value, variance]
			type	string
			type	object
			properties
			sum_of_weights	type	array
				items	type	number
			sum_of_weights_squared	type	array
				items	type	number
			values	type	array
				items	type	number
			variances	type	array
				items	type	number
			additionalProperties	False
	additionalProperties	False

Full schema¶

The full schema is below:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "https://raw.githubusercontent.com/scikit-hep/uhi/main/src/uhi/resources/histogram.schema.json",
  "title": "Histogram",
  "type": "object",
  "additionalProperties": false,
  "patternProperties": {
    ".+": {
      "type": "object",
      "required": ["axes", "storage"],
      "additionalProperties": false,
      "properties": {
        "metadata": {
          "type": "object",
          "description": "Arbitrary metadata dictionary."
        },
        "axes": {
          "type": "array",
          "description": "A list of the axes of the histogram.",
          "items": {
            "oneOf": [
              { "$ref": "#/$defs/regular_axis" },
              { "$ref": "#/$defs/variable_axis" },
              { "$ref": "#/$defs/category_str_axis" },
              { "$ref": "#/$defs/category_int_axis" },
              { "$ref": "#/$defs/boolean_axis" }
            ]
          }
        },
        "storage": {
          "description": "The storage of the bins of the histogram.",
          "oneOf": [
            { "$ref": "#/$defs/int_storage" },
            { "$ref": "#/$defs/double_storage" },
            { "$ref": "#/$defs/weighted_storage" },
            { "$ref": "#/$defs/mean_storage" },
            { "$ref": "#/$defs/weighted_mean_storage" }
          ]
        }
      }
    }
  },
  "$defs": {
    "regular_axis": {
      "type": "object",
      "description": "An evenly spaced set of continuous bins.",
      "required": ["type", "lower", "upper", "bins", "underflow", "overflow"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "regular" },
        "lower": { "type": "number", "description": "Lower edge of the axis." },
        "upper": { "type": "number", "description": "Upper edge of the axis." },
        "bins": {
          "type": "integer",
          "minimum": 0,
          "description": "Number of bins in the axis."
        },
        "underflow": {
          "type": "boolean",
          "description": "True if there is a bin for underflow."
        },
        "overflow": {
          "type": "boolean",
          "description": "True if there is a bin for overflow."
        },
        "circular": {
          "type": "boolean",
          "description": "True if the axis wraps around."
        },
        "metadata": {
          "type": "object",
          "description": "Arbitrary metadata dictionary."
        }
      }
    },
    "variable_axis": {
      "type": "object",
      "description": "A variably spaced set of continuous bins.",
      "required": ["type", "edges", "underflow", "overflow"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "variable" },
        "edges": {
          "oneOf": [
            {
              "type": "array",
              "items": { "type": "number", "minItems": 2, "uniqueItems": true }
            },
            {
              "type": "string",
              "description": "A path (URI?) to the edges data."
            }
          ]
        },
        "underflow": { "type": "boolean" },
        "overflow": { "type": "boolean" },
        "circular": { "type": "boolean" },
        "metadata": {
          "type": "object",
          "description": "Arbitrary metadata dictionary."
        }
      }
    },
    "category_str_axis": {
      "type": "object",
      "description": "A set of string categorical bins.",
      "required": ["type", "categories", "flow"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "category_str" },
        "categories": {
          "type": "array",
          "items": { "type": "string" },
          "uniqueItems": true
        },
        "flow": {
          "type": "boolean",
          "description": "True if flow bin (at the overflow position) present."
        },
        "metadata": {
          "type": "object",
          "description": "Arbitrary metadata dictionary."
        }
      }
    },
    "category_int_axis": {
      "type": "object",
      "description": "A set of integer categorical bins in any order.",
      "required": ["type", "categories", "flow"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "category_int" },
        "categories": {
          "type": "array",
          "items": { "type": "integer" },
          "uniqueItems": true
        },
        "flow": {
          "type": "boolean",
          "description": "True if flow bin (at the overflow position) present."
        },
        "metadata": {
          "type": "object",
          "description": "Arbitrary metadata dictionary."
        }
      }
    },
    "boolean_axis": {
      "type": "object",
      "description": "A simple true/false axis with no flow.",
      "required": ["type"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "boolean" },
        "metadata": {
          "type": "object",
          "description": "Arbitrary metadata dictionary."
        }
      }
    },
    "int_storage": {
      "type": "object",
      "description": "A storage holding integer counts.",
      "required": ["type", "data"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "int" },
        "data": {
          "oneOf": [
            {
              "type": "string",
              "description": "A path (URI?) to the integer bin data."
            },
            { "type": "array", "items": { "type": "integer" } }
          ]
        }
      }
    },
    "double_storage": {
      "type": "object",
      "description": "A storage holding floating point counts.",
      "required": ["type", "data"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "double" },
        "data": {
          "oneOf": [
            {
              "type": "string",
              "description": "A path (URI?) to the floating point bin data."
            },
            { "type": "array", "items": { "type": "number" } }
          ]
        }
      }
    },
    "weighted_storage": {
      "type": "object",
      "description": "A storage holding floating point counts and variances.",
      "required": ["type", "data"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "int" },
        "data": {
          "oneOf": [
            {
              "type": "string",
              "description": "A path (URI?) to the floating point bin data; outer dimension is [value, variance]"
            },
            {
              "type": "object",
              "required": ["values", "variances"],
              "additionalProperties": false,
              "properties": {
                "values": { "type": "array", "items": { "type": "number" } },
                "variances": { "type": "array", "items": { "type": "number" } }
              }
            }
          ]
        }
      }
    },
    "mean_storage": {
      "type": "object",
      "description": "A storage holding 'profile'-style floating point counts, values, and variances.",
      "required": ["type", "data"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "int" },
        "data": {
          "oneOf": [
            {
              "type": "string",
              "description": "A path (URI?) to the floating point bin data; outer dimension is [counts, value, variance]"
            },
            {
              "type": "object",
              "required": ["counts", "values", "variances"],
              "additionalProperties": false,
              "properties": {
                "counts": { "type": "array", "items": { "type": "number" } },
                "values": { "type": "array", "items": { "type": "number" } },
                "variances": { "type": "array", "items": { "type": "number" } }
              }
            }
          ]
        }
      }
    },
    "weighted_mean_storage": {
      "type": "object",
      "description": "A storage holding 'profile'-style floating point ∑weights, ∑weights², values, and variances.",
      "required": ["type", "data"],
      "additionalProperties": false,
      "properties": {
        "type": { "type": "string", "const": "int" },
        "data": {
          "oneOf": [
            {
              "type": "string",
              "description": "A path (URI?) to the floating point bin data; outer dimension is [∑weights, ∑weights², value, variance]"
            },
            {
              "type": "object",
              "required": [
                "sum_of_weights",
                "sum_of_weights_squared",
                "values",
                "variances"
              ],
              "additionalProperties": false,
              "properties": {
                "sum_of_weights": {
                  "type": "array",
                  "items": { "type": "number" }
                },
                "sum_of_weights_squared": {
                  "type": "array",
                  "items": { "type": "number" }
                },
                "values": { "type": "array", "items": { "type": "number" } },
                "variances": { "type": "array", "items": { "type": "number" } }
              }
            }
          ]
        }
      }
    }
  }
}