Simple binary serialization

Simple binary serialization (SBS) features:

  • schema based

  • small schema language optimized for human readability

  • schema re-usability and organization into modules

  • small number of built in types

  • polymorphic data types

  • binary serialization

  • designed to enable simple and efficient implementation

  • optimized for small encoded data size

Example of SBS schema:

module Module

Entry(K, V) = Tuple {
    key: K
    value: V
}

Collection(K) = Union {
    null: None
    bool: Entry(K, Boolean)
    int: Entry(K, Integer)
    float: Entry(K, Float)
    str: Entry(K, String)
    bytes: Entry(K, Bytes)
}

IntKeyCollection = Collection(Integer)

StrKeyCollection = Collection(String)

Schema definition

SBS shemas are written as UTF-8 encoded files with .sbs file extension. Characters ,, space, \t, \r and \n are considered white-space characters and are ignored. Characters (, ), {, }, :, = and white-space characters are used as delimiters between other identifiers. All other valid identifiers are defined by regex r'[A-Za-z][A-Za-z0-9_]*'. Character # is used as start of comment that spans to the end of line.

Each file represents single SBS schema module. Name of SBS module is defined by module <name> directive where <name> represents user-defined module name. This directive is only mandatory part of each SBS schema and should be placed at the beginning of each .sbs file. Example of minimal valid SBS schema:

module ModuleName

Rest of .sbs files contains arbitrary number of user-defined types. Each type definition is written as <new_type>(<t1> <t2> ...) = <other_type> where:

  • <new_type>

    Name of new user-defined type.

  • <t1>, <t2>, …

    Identifiers representing parametric data type arguments used in <other_type> definition. If these arguments are not used, parenthesis can be omitted.

  • <other_type>

    Other user defined or built in type which encoding should be used for encoding of <new_type>. User defined types are specified as <module_name>.<type_name>(<t1> <t2> ...). If <type_name> refers to type defined in same module, <module_name>. can be omitted. If user defined type is not parametric data type, parenthesis should be omitted.

Builtin data types include:

  • simple data types

    • Boolean

      Data type with two possible values representing true and false.

    • Integer

      Unconstrained signed integer value.

    • Float

      Floating point value that can be encoded with 8 bytes according to IEEE 754.

    • String

      UTF-8 encoded string value.

    • Bytes

      Array of byte values of arbitrary length.

  • composite data types

    • Array(<t>)

      Parametric data type that defines arbitrary length Array where all elements are of type defined by <t>.

    • Tuple { <entry1>: <t1>, <entry2>: <t2>, ... }

      Collection of user-defined entries where each entry has entry identifier (<entry1>, <entry2>, …) and entry type (<t1>, <t2>, …). Encoded data must contain all entries specified by type definition.

    • Union { <entry1>: <t1>, <entry2>: <t2>, ... }

      Type that can represent one of types defined by <t1>, <t2>, … Encoded data must contain only single entry identified by entry identifier (<entry1>, <entry2>, …).

  • derived data types

    These include predefined types that can be expressed as:

    None = Tuple {}
    
    Maybe(a) = Union {
        Nothing: None
        Just: a
    }
    

Data encoding

Boolean

Boolean value is encoded as single byte with value 0x01 as true and 0x00 as false.

Integer

Signed integer values are encoded as variable length byte array. Most significant bit in all bytes, except last one, is set to 0 (last bytes most significant bit is 1). Concatenation of other bits represent big-endian encoded two’s complement binary representation of integer value.

+-----------------+-------+-----------------+
|        0        |       |        m        |
| 7 6 5 4 3 2 1 0 |       | 7 6 5 4 3 2 1 0 |
+-----------------+  ...  +-----------------+
| 0 xn ... x(n-7) |       | 1   x6 ... x0   |
+-----------------+-------+-----------------+

Float

Floating point values are encoded according to IEEE 754 binary64 (double precision) format.

Bytes

Bytes array is encoded “as is” and prefixed with bytes count encoded as Integer.

String

String value is encoded as UTF-8 encoded Bytes.

Array

Array is encoded as sequential concatenation of each element encoding. This concatenated bytes are prefixed with array’s element count encoded as Integer.

Tuple

Tuple is encoded as sequential concatenation of tuple’s elements encoding according to elements order defined by schema.

Union

Union encodes single element prefixed with encoded element’s zero-based index as Integer.

Python implementation

Simple binary serializer

This implementation of SBS encoder/decoder translates between SBS types and Python types according to following translation table:

SBS type

Python type

Boolean

bool

Integer

int

Float

float

String

str

Bytes

bytes

Array

List[Data]

Tuple

Dict[str,Data]

Union

Tuple[str,Data]

SBS Tuple and Union types without elements are translated to None.

Example usage of SBS serializer:

import hat.sbs

repo = hat.sbs.Repository('''
    module Module

    Entry(K, V) = Tuple {
        key: K
        value: V
    }

    T = Array(Maybe(Entry(String, Integer)))
''')
data = [
    ('Nothing', None),
    ('Just', {
        'key': 'abc',
        'value': 123
    })
]
encoded_data = repo.encode('Module', 'T', data)
decoded_data = repo.decode('Module', 'T', encoded_data)
assert data == decoded_data
hat.sbs.default_schemas_sbs_path

default path to schemas_sbs directory

Type

pathlib.Path

class hat.sbs.Repository(*args)

Bases: object

SBS schema repository.

Supported initialization arguments:
  • string containing sbs schema

  • file path to .sbs file

  • path to direcory recursivly searched for .sbs files

  • other repository

Parameters

args (Union[Repository, pathlib.Path, str]) –

encode(module_name, type_name, value)

Encode value.

Parameters
  • module_name (Optional[str]) –

  • type_name (str) –

  • value (Union[bool, int, float, str, bytes, List[Data], Dict[str, Data], Tuple[str, Data]]) –

Return type

bytes

decode(module_name, type_name, data)

Decode data.

Parameters
  • module_name (Optional[str]) –

  • type_name (str) –

  • data (Union[bytes, bytearray, memoryview]) –

Return type

Union[bool, int, float, str, bytes, List[Data], Dict[str, Data], Tuple[str, Data]]

to_json()

Export repository content as json serializable data.

Entire repository content is exported as json serializable data. New repository can be created from the exported content by using Repository.from_json().

Return type

Union[None, bool, int, float, str, List[Data], Dict[str, Data]]

static from_json(data)

Create new repository from content exported as json serializable data.

Creates a new repository from content of another repository that was exported by using Repository.to_json().

Parameters

data (Union[pathlib.PurePath, bool, int, float, str, bytes, List[Data], Dict[str, Data], Tuple[str, Data]]) –

Return type

Repository

hat.sbs.Data

The central part of internal API.

This represents a generic version of type ‘origin’ with type arguments ‘params’. There are two kind of these aliases: user defined and special. The special ones are wrappers around builtin collections and ABCs in collections.abc. These must have ‘name’ always set. If ‘inst’ is False, then the alias can’t be instantiated, this is used by e.g. typing.List and typing.Dict.

alias of Union[bool, int, float, str, bytes, List[Data], Dict[str, Data], Tuple[str, Data]]