Serialization

This document summarizes the serialization efforts of OSCAR, how it works, and what our long-term vision is. Serialization broadly speaking is the process of reading and writing data. There are many reasons for this feature in OSCAR, but the main reason is communication on mathematics by mathematicians.

We implement our serialization in accordance with the MaRDI file format specification described here. Which means we use a JSON extension to serialize data.

How it works

The mechanism for saving and loading is very simple. It is implemented via two methods save and load, and works in the following manner:

julia> save("/tmp/fourtitwo.mrdi", 42);

julia> load("/tmp/fourtitwo.mrdi")
42

The filename hints to the MaRDI file format, which employs JSON. The file looks as follows:

{
  "_ns": {
    "Oscar": [
      "https://github.com/oscar-system/Oscar.jl",
      "0.14.0-DEV-8fe2abbe39890a7d3324adcba7f91812119c586a"
    ]
  },
  "_type": "Base.Int",
  "data": "42"
}

It contains the precise version of OSCAR used for this serialization. The content is "42", it represents a Base.Int, according to the _type field.

Implementation

To list and describe all implementations and encodings of all types in OSCAR is not a possible feat due to the arbitrarily deep and nested type structures available in OSCAR, we point any developer looking to understand the encodings of certain types to the OSCAR source code. All files for serialization can be found in the folder src/Serialization. The convention of the files there follows the overall structure of OSCAR, i.e. the file src/Serialization/PolyhedralGeometry.jl contains functions for serializing objects of the polyhedral geometry section.

We include a basic example of the encoding for a QQPolyRingElem so one can get a taste before delving into the source code. Here we store the polynomial x^3 + 2x + 1//2. The encoding for polynomials is to store a list of tuples, where each entry in the list represents a term of the polynomial and where the first entry of the tuple is the exponent and the second entry is the coefficient. Here we serialize a univariate polynomial so the first entries are always integers, in general this may be an array of integers. The coefficients here are elements of QQ however in general the coefficients themselves may be described as polynomials of the generators of some field extension, i.e. the second entry may again be a list of tuples and so on. The nested structure of the coefficient will depend on the description of the field extension.

{
  "_ns": {
    "Oscar": [
      "https://github.com/oscar-system/Oscar.jl",
      "1.1.0-DEV-6f7e717c759f5fc281b64f665c28f58578013c21"
    ]
  },
  "_refs": {
    "e6c5972c-4052-4408-a408-0f4f11f21e49": {
      "_type": "PolyRing",
      "data": {
        "base_ring": {
          "_type": "QQField"
        },
        "symbols": [
          "x"
        ]
      }
    }
  },
  "_type": {
    "name": "PolyRingElem",
    "params": "e6c5972c-4052-4408-a408-0f4f11f21e49"
  },
  "data": [  [      "0",      "1//2"    ],
	     [      "1",      "2"       ],
	     [      "3",      "1"       ]  ]
}

When trying to understand an encoding of a particular type in OSCAR it is always best to make a minimal example and store it. Then use a pretty printer to format the JSON, and have it close by while going through the source code.

Description of the saving and loading mechanisms

We require that any types serialized through OSCAR are registered using @register_serialization_type. This is to ensure user safety during the load process by avoiding code evaluation.

@register_serialization_typeMacro
@register_serialization_type NewType "String Representation of type" uses_id uses_params

@register_serialization_type is a macro to ensure that the string we generate matches exactly the expression passed as first argument, and does not change in unexpected ways when import/export statements are adjusted.

The last three arguments are optional and can arise in any order. Passing a string argument will override how the type is stored as a string. The last two are boolean flags. When setting uses_id the object will be stored as a reference and will be referred to throughout the serialization sessions using a UUID. This should typically only be used for types that do not have a fixed normal form for example PolyRing and MPolyRing. Using the uses_params flag will serialize the object with a more structured type description which will make the serialization more efficient see the discussion on save_type_params / load_type_params below.

source

There are three pairs of saving and loading functions that are used during serialization:

  1. save_typed_object, load_typed_object
  2. save_object, load_object
  3. save_type_params, load_type_params

save_type_object / load_type_object

For the most part these functions should not be touched, they are high level functions and are used to (de)serialize the object with its type information as well as its data. The data and type nodes are set in save_typed_object resulting in a "data branch" and "type branch". The usage of these functions can be used inside save_object / load_object and save_type_params / load_type_params. However using save_typed_object inside a save_object implementation will lead to a verbose format and should at some point be moved to save_type_params. Their implemention can be found in the main.jl file.

save_object / load_object

These functions should be the first functions to be overloaded when implementing the serialization of a new type. The functions save_data_dict and save_data_array are helpers functions that structure the serialization.

The examples show how they can be used to save data using the structure of an array or dict. Each nested call to save_data_dict or save_data_array should be called with a key that can be passed as the second parameter.

Examples
Example 1
function save_object(s::SerializerState, obj::NewType)
  save_data_array(s) do
    save_object(s, obj.1)
    save_object(s, obj.2)
   save_data_dict(s) do
      save_object(s, obj.3, :key1)
      save_object(s, obj.4, :key2)
    end
  end
end

This will result in a data format that looks like this.

[
  obj.1,
  obj.2,
  {
    "key1": obj.3,
    "key2": obj.4
  }
]

With the corresponding loading function similar to this.

function load_object(s::DeserializerState, ::Type{<:NewType})
  (obj1, obj2, obj3_4) = load_array_node(s) do (i, entry)
    if entry isa JSON3.Object
      obj3 = load_object(s, Obj3Type, :key1)
      obj4 = load_object(s, Obj3Type, :key2)
      return OtherType(obj3, obj4)
    else
      if p(entry) == c
        load_object(s, Obj1Type)
      else
        load_object(s, Obj2Type)
      end
    end
  end
  return NewType(obj1, obj2, obj3_4)
end
Example 2
function save_object(s::SerializerState, obj::NewType)
  save_data_dict(s) do
    save_object(s, obj.1, :key1)
    save_data_array(s, :key2) do
      save_object(s, obj.3)
      save_typed_object(s, obj.4) # This is ok
    end
  end
end

This will result in a data format that looks like this.

{
  "key1": obj.1,
  "key2":[
    obj.3,
    {
      "type": "Type of obj.4",
      "data": obj.4
    }
  ]
}

The corresponding loading function would look something like this.

function load_object(s::DeserializerState, ::Type{<:NewType}, params::ParamsObj)
   obj1 = load_object(s, Obj1Type, params[1], :key1)

   (obj3, obj4) = load_array_node(s, :key2) do (i, entry)
     if i == 1
       load_object(s, Obj3Type, params[2])
     else
       load_typed_object(s)
     end
   end
   return NewType(obj1, OtherType(obj3, obj4))
 end

This is ok

function save_object(s::SerializerState, obj:NewType)
  save_object(s, obj.1)
end

While this will throw an error

function save_object(s::SerializerState, obj:NewType)
  save_object(s, obj.1, :key)
end

If you insist on having a key you should use a save_data_dict.

function save_object(s::SerializerState, obj:NewType)
  save_data_dict(s) do
    save_object(s, obj.1, :key)
  end
end

function load_object(s::SerializerState, ::Type{<:NewType})
  load_node(s, :key) do x
    info = do_something(x)

    if info
      load_object(s, OtherType)
    else
      load_object(s, AnotherType)
    end
  end
end

Note for now save_typed_object must be wrapped in either a save_data_array or save_data_dict. Otherwise you will get a key override error.

save_type_params / load_type_params

The serialization mechanism stores data in the format of a tree, with the exception that some nodes may point to a shared reference. The "data branch" is anything that is a child node of a data node, whereas the "type branch" is any information that is stored in a node that is a child of a type node. Avoiding type information inside the data branch will lead to a more efficient serialization format. When the uses_params is set when registering the type with @register_serialization_type (de)serialization will use save_type_params / load_type_params to format the type information. In general we expect that implementing a save_type_params and load_type_params should not always be necessary. Many types will serialize their types in a similar fashion for example serialization of a FieldElem will use the save_type_params / load_type_params from RingElem since in both cases the only parameter needed for such types is their parent.

Serializers

The code for the different types of serializers and their states is found in the serializers.jl file. Different serializers have different use cases, the default serializer JSONSerializer is used for writting to a file. Currently the only other serializer is the IPCSerializer which at the moment is quite similar to the JSONSerializer except that it does not store the refs of any types that are registered with the uses_id flag. When using the IPCSerializer it is left up to the user to guarantee that any refs required by a process are sent prior.

Upgrades

All upgrade scripts can be found in the src/Serialization/Upgrades folder. The mechanics of upgrading are found in the main.jl file where the Oscar.upgrade function provides the core functionality. Upgrading is triggered during load when the version of the file format to be loaded is older than the current Oscar version.

upgradeFunction
upgrade(format_version::VersionNumber, dict::Dict)

Finds the first version where an upgrade can be applied and then incrementally upgrades to each intermediate version until the structure of the current version has been achieved.

source
upgrade_dataFunction
upgrade_data(upgrade::Function, s::UpgradeState, dict::Dict)

upgrade_data is a helper function that provides functionality for recursing on the tree structure. It is independent of any particular file format version and can be used in any upgrade script.

source

Upgrade Scripts

All upgrade scripts should be contained in a file named after the version they upgrade to. For example a script that upgrades to Oscar version 0.13.0 should be named 0.13.0.jl.

UpgradeScriptType
UpgradeScript(version::VersionNumber, script::Function)

Any upgrade scripts schould be created using the UpgradeScript constructor and then pushed to the upgrade_scripts_set. The name of the function is not particularly important however one should be assigned so that it can be used for recursion if necessary for upgrade. The upgrde script should be independent from any Oscar code and should relie solely on julia core functionality so to avoid conflicts with any future Oscar code deprecations.

Example

push!(upgrade_scripts_set, UpgradeScript(
  v"0.13.0",
  function upgrade_0_13_0(s::UpgradeState, dict::Dict)
      ...
  end
))
source

Challenges

This section documents the various challenges we (will) encounter while implementing this feature.

  • OSCAR is based on several subsystems, some of which already have their own serialization. We want this to be compatible, if possible in both directions.
  • Many mathematical objects need context to be understood. A polynomial needs the ring it lives in, a group element needs the surrounding group, a divisor needs the underlying variety, etc. We will need a way to store this context along the objects.
  • Context should not be stored twice: A matrix of polynomials should only store the surrounding ring once.
  • Support other data formats: It has been proposed to not only support JSON, but binary formats needed for HPC communication as well. It is unclear whether this needs a separate implementation.
  • Versioning and upgrading: Work on OSCAR will change what its objects look like. Nevertheless, we still want to be able load data written by older versions of OSCAR. For this we intend to develop an upgrade mechanism.

Another important point is the wider mathematical context of the data and code. For data associated to a publication, this context is provided by the paper.

Goals

The general goal is to make mathematical data FAIR, a goal for which we cooperate with the MaRDI project.

The ramifications of making mathematical data FAIR are manifold.

  • It becomes easier to exchange data and code with fellow mathematicians, enhancing communication and boosting research.
  • Computer experiments and new implementations require a lot of work and hence deserve to be recognized in form of a publication. Standardizing data plays an important role for this process.
  • Future generations of mathematicians will be able to reuse both data and code if we establish a FAIR culture.

External Implementations

Any external body implementing a save/load following the .mrdi format specification and using the OSCAR namespace should be sure to check validity against our schema defined here.

We make no attempt whatsoever to verify the mathematics of the file, and neither should anyone implementing a save/load. Loading should not throw a parse error if the mathematics of the file is incorrect, the file should be parsed and allow the computer algebra system to throw the error. We cannot guarantee that any file that has been manipulated by hand is still valid and should be validated against the schema. In the same way we cannot guarantee that any files created externally are valid in terms of the mathematics either, these will not lead to a parse error but instead will be handle as though the incorrect input has been passed to one of the Oscar functions.

External implementations should not be expected to read or write all possible Oscar types. It is perfectly valid for external implementations to throw parse errors when a certain file format is unexpected. For example Oscar will parse a QQFieldElem that has data value "0 0 7 // - 1 0" as -7//10, even though this is not how it is serialized. We feel we should not restrict users when deserializing to formats that may have issues deserializing the same format externally.

Allowing extensions to JSON is not recommended, this is to keep the scope of possible software that can parse the given JSON as large as possible. For example some JSON extensions allow comments in the files, Oscar cannot parse such JSONs and we recommend that any comments should be placed in the meta field.

When writing UUIDs, adhere to version four UUIDs specified by RFC 4122.