How fast can safe, no-compromises, schema-oriented serialization be? Well, apparently ludicrously fast…

Over the past month, we have been working hard to add Rust support to Bebop, our open-source serialization language. This is a recounting of our work and results.

A Journey into Rust

We love Rust for a lot of reasons at Rainway, but for serialization it brings a few very useful qualities to the table. There are also a few unique challenges when it comes to code generation.

There were a few goals we wanted to achieve right out of the gate:

  1. Low-copy serialization making use of lifetimes for safe, raw-data references
  2. It should be idiomatic(-ish) to integrate nicely with the existing ecosystem
  3. We have to be faster than everyone else :)

Low-copy normally comes with a lot of baggage and unsafe assumptions, but thanks to the Rust compiler, we are able to borrow large chunks of data directly from the raw buffer it came in. In C++ this would be terribly unsafe and prone to foot carnage. Yes, you can do it, and yes, people do it, but in Rust, we can ensure invariants are upheld when developers are using the generated code without need of blaring warnings that are sure to be ignored at least occasionally. The compiler is able to convey some of our usage expectations for us.

The first step in our Rust adventure was creating a rough implementation by hand to get a feel for what the Bebop compiler should generate, and also as a chance to write the runtime library. With this came a definition for what encoding and decoding should look like—though it took a few tries to finally settle on the traits we have now, and more conveniences still could be added.

pub trait Record<'raw>: SubRecord<'raw> {
    const OPCODE: Option<u32> = None;

    fn deserialize(raw: &'raw [u8]) -> DeResult<Self> { /* ... */ }
    fn serialize<W: Write>(&self, dest: &mut W) -> SeResult<usize> { /* ... */ }
}

pub trait SubRecord<'raw>: Sized {
    const MIN_SERIALIZED_SIZE: usize;
    const EXACT_SERIALIZED_SIZE: Option<usize> = None;

    fn serialized_size(&self) -> usize;
    fn _serialize_chained<W: Write>(&self, dest: &mut W) -> SeResult<usize>;
    fn _deserialize_chained(raw: &'raw [u8]) -> DeResult<(usize, Self)>;
}

It will also be easy to add more ways to interact with record types down the road by extending this trait with new functions and convenience callers for different situations.

The main takeaway: Record is the outward facing serialization entrypoint, and SubRecord is a way for us to avoid generating a ton of duplicate code. This is because we’re able to “chain” the serialization and deserialization calls, relying on the compiler to optimize out function calls where appropriate.

Rustification

There are some noteworthy differences between how the Rust runtime and generated code looks compared to the other implementations.

Serialize and Deserialize Functions

In the other languages, we have external functions used for encoding and decoding the records. While we could have taken this approach, the ability to be serialized or deserialized is often represented as a composited capability represented by a trait, so that is what we have done.

Also, if desired, trait functions can still be accessed statically using Record::serialize(&my_instance, &mut dest).

We also chose to use serialize and deserialize as the main functions rather than Encode and Decode to be more consistent with other popular serialization frameworks in the language.

Further, instead of generating code for serializing base types inline, we make use of inline functions in the runtime and the SubRecord trait to chain serialization. This makes generation a little cleaner.

Namespaces

The Rust version has no namespace support whatsoever. Fundamentally this is because Rust as a language has a module-based approach over namespaces, and it would be very odd to have additional modules created since they won’t automatically merge with others of the same name and could cause conflicts instead.

Instead, the default generation approach (using the bebop-tools crate) is to create a new mod directory and generate a mod.rs file which defines all the individually generated bebop files.

Identifiers

Many languages lack a single standard for naming conventions cough C++ cough. Rust has one, though, and to maintain parity with the rest of your program even if the Bebop schemas were written with another language in mind, the identifiers are re-cased to Rust standards.

Enums, enum values, struct names, message names, union names, and union types are PascalCase; struct and message fields are snake_case; and constants are UPPER_SNAKE_CASE.

A potential gotcha is that Rust “namespaces” enum and union variants so structs/messages defined within a union are not globally accessible in Rust as they are in other languages.

Traits

Some more work still needs to be done on this, but right now we automatically derive many of the common traits that are possible such as Clone, Debug, and PartialEq.

Traits like Hash, Eq, and Ord are top candidates for future improvements.

Where reasonable, instead of creating custom functions, we try to use traits such as TryFrom<u32> for enums.

Generating Code

With our new Record definition and serialization set up for all types that schemas can be composed of, all that was left was most of the work: generating code.

Anyone who has ever written code generation before, especially for Rust, is probably aware how hard it is to just “throw code out there” with Rust because of all the static checks it makes. These checks are amazing when you are writing code by hand or need it to be maintainable, but can cause a lot of headaches in the code generation process since it creates more if statements and conditional generation cases. (The benefit is more of the ASTs’ information is made explicit in code.)

It is for this reason, I believe, that most serialization implementations avoid using borrowed types which introduce even more syntax challenges.

The good news is even though it takes more work to generate, we are able to define static checks for code which interacts with ours to ensure it is being used correctly. Things that would be very unsafe in a language like C++ (such as having your deserialized value reference the original buffer) are safe and clearly defined in Rust.

As an aside, we did consider using macros to offload most of the work to Rust, and while down the road some might get shuffled into macros, the choice to go primarily with bebopc writing the code was made to be consistent with the other implementations and how many other schema-based serializers work. How we are doing it now creates more-readable, generated code at the cost of readability in bebopc.

Structures and Messages

Structs are pretty straightforward, but do require a little chaining work (recursive trait calls) for serialization and deserialization. Ultimately, we just iterate over all fields in the AST and then call the relevant SubRecord function for each.

Messages are a little more interesting, but have a great representation in Rust by wrapping all fields in Option<T>.

#[derive(Clone, Debug, PartialEq, Default)]
pub struct Song<'raw> {
    pub title: ::core::option::Option<&'raw str>,
    pub year: ::core::option::Option<u16>,
    pub performers: ::core::option::Option<::std::vec::Vec<Performer<'raw>>>,
}

Serialization of message types requires some if-let pattern matching—not too bad.

Deserialization is where it gets interesting because we don’t know what fields we are going to get, and each one being a different size means we have to handle them one at a time.

(You can read more on message encoding format here)

// Some code omitted for brevity
fn _deserialize_chained(raw: &'raw [u8]) -> ::bebop::DeResult<(usize, Self)> {
    let mut i = 0;
    let len = ::bebop::read_len(&raw[i..])? + ::bebop::LEN_SIZE;
    i += ::bebop::LEN_SIZE;

    if raw.len() < len {
        return Err(::bebop::DeserializeError::MoreDataExpected(len - raw.len()));
    }

    let mut _title = None;
    let mut _year = None;
    let mut _performers = None;

    while i < len {
        let di = raw[i];

        i += 1;
        match di {
            0 => {
                break;
            }
            1 => {
                let (read, value) = <&'raw str>::_deserialize_chained(&raw[i..])?;
                i += read;
                _title = Some(value)
            }
            // ...
            _ => {
                i = len;
                break;
            }
        }
    }

    Ok((i, Self { title: _title, year: _year, performers: _performers, }))
}

Unions

Unions were originally added to Bebop with Rust-like enums in mind, which made them very easy to implement. The generated code just inlines the sub-message and sub-struct definitions into the enum and then uses matching to figure out how to respond on serialization and deserialization.

It really is that straightforward. Thanks, Rust!

Challenges with Types

Naming types in Rust is hard. If you compare the TypeName function in Rust against the other code generators, the Rust implementation is over twice the length and requires additional helper functions. Some of this is self-induced complexity to support owned/borrowed/static types, because we were originally planning to create both a borrowed and owned structure for each type (we still might, after dogfooding). Another part of it is because we have a lot of lifetimes that have to get conditionally tacked on to support our goal of low-copy deserialization.

To make it worse, once you decide to introduce a lifetime parameter, it propagates. Any struct/enum which contains an attribute that needs a lifetime, also needs a lifetime. This is not hard to understand, but it does take a function that in our heads should have just been pattern matching (i.e. “what name should I use”), and turns it into an AST searching problem.

A brief example of one of the worst offenders is lists. Suppose you have a struct T which is not of a fixed size (serialized form’s byte-count depends on content). You then have:

  • Const: &'static [T; <len>] (the static is implied and not actually needed when in context)
  • Owned: Vec<T>
  • Borrowed: &'raw [T]

But what if it is a list of strings?

  • Const: &'static [&'static str; <len>]
  • Owned: Vec<String>
  • Borrowed: Vec<&'raw str>; (you can’t use &'raw [&'raw str] because strings are not fixed-size, so you need to own a list of pointers into the raw data).

Another complexity came up to support fixed-sized structs which we can more efficiently access using index offsets from their beginning. Most of this complexity is now in the SliceWrapper which we talk about later, but it does mean we have yet another check when naming the type to see not only whether it is owned or not, but also whether it is a fixed-size type or not.

Testing Code

During the implementation of Rust support in Bebop, we wrote a lot of tests.

  • Runtime unit tests
  • Generated code compilation testing using many schemas; the compiler is the test here
  • Generated code functionality tests using a couple specific schemas and hand-written tests
  • Integration tests to ensure the binary forms are all cross-compatible between languages

There are also some Rust benchmarks:

  • Runtime benchmarks (did not turn out to be very useful at least for now)
  • Benchmarking an example schema with sample data

Between the tests and benchmarks, we now have a framework for ensuring the code works, stays working, and does not slow down.

Optimizations

Our very first, ultra-alpha implementation was not the fastest or the safest (I may have made some poor memory alignment assumptions).

Borrowing Stuff

Okay, this is kinda straightforward at the surface and also kinda obvious, but bear with me for one moment.

We want to be the fastest kid on the block right? So we have to keep our memory allocations down and stop wasting time with intermediary copies of bits. When you serialize, you fundamentally have to write a copy, but if you can avoid writing an intermediary, that is amazing. The same goes for deserialization.

The most obvious targets are strings and arrays so that is what we do. Why clone when you can borrow?

Fixed-sized Types and Arrays

Building on the last point: Strings are easy, and arrays of primitives are not bad if they are single-byte types. But we are smart cookies here, so we knew we could do one step better and recognize not only fixed-sized types like a u32 as well with some alignment finagling, but also struct A { x: u32, y: u32 } and even struct B { x: A, y: A }.

These types are trivial to copy when there is only one, and when it is an array, we can calculate an offset easily to find a given one. Also, we are given the size of an array at the beginning of the array’s serialized form, so there is no guessing or even a need to parse the entire array at deserialization time. We can just store a pointer and jump over the data until it is called on.

Fixed-sized types are automatically recognized using AST parsing and get tagged with the #[repr(packed)] flag and also #[derive(Copy)] in addition to getting a FixedSized trait implementation. This means they now have 1-byte alignment and can more easily be cast from bytes.

(For now the packed representation is good enough but support for this is deprecated by Rust going forward, so we must revisit this.)

There was a slight mistake early on by using unsafe raw byte array casting to a fixed-sized type which entirely ignored alignment and would break in certain conditions (kinda obvious in hindsight). Realizing this mistake set us up for a new core type capable of handling these complexities. Thus, the SliceWrapper was born.

pub enum SliceWrapper<'a, T: FixedSized> {
    Raw(&'a [u8]),
    Cooked(&'a [T]),
}

SliceWrapper allows us to abstract the handling of slices which can be either a raw buffer or a non-raw (“cooked”) slice of type T when the user is trying to prepare data for serialization. This helps us get around the alignment issues and provide a single interface for borrowed arrays.

Buried in the deserialization code is an alignment check that allows us to cast the slice type if certain conditions are met, significantly reducing the work that need to be done in some cases. However, when alignment or endianness does not permit, it falls back to deserializing on the fly using offsets.

Later on, we might extend this pattern for Owned/Borrowed pairs like Vec<T>/SliceWrapper<T> and String/&str to prevent the need to keep a copy of data around until the record is fully serialized.

Pre-calculating Serialized Size

After some of our early optimizations, we ran some benchmarks using Criterion and found the performance of deserialization was pretty good, but serialization was slower than basically everyone else. Obviously unacceptable.

After generating a flamegraph the problem became a little more obvious: A lot of time was being spent on allocations. This didn’t make a lot of sense at first because the allocation of the destination buffers were happening outside the benchmark itself.

Well, one of the interesting challenges of taking an &mut Write type is that Write does not allow you to backtrack. Once data is written, that is it. No cursor for you!

No problem you say, right?

That is, until you account for Unions and Messages including a size (in bytes) of the data that follows, data that we don’t want to process twice. (We include this size for forwards compatibility, check out the wire format for more details.)

The naive solution had been to simply chain serialize into a buffer and then find the size of that buffer and write it before writing the buffered data.

Let that sink in for a moment… To minimize work, prefer allocating buffers (writing) to making multiple passes of the same data (reading).

Hindsight is 20/20.

With a burning desire to at least beat the performance of JSON and a newly realized sense of self-fallibility, the idea to recursively calculate the size of any given SubRecord was born. This will only need to be called by Message and Union types, however, when those types are nested, the current implementation which caches nothing is not 100% optimal. Turns out even with that caveat, the flamegraph showed that it takes a negligible amount of time compared to what the allocs required—surprise!

At this point, Bebop serialization does not allocate any additional buffers, and wow, you could say it made a difference.

Benchmarks

The following benchmarks are made using Criterion and a large sample document which you can find here.

A large document was chosen to include more structure variations and include many conditions. This is not the same document that was used for the original post and therefore the absolute values are not comparable (so don’t ask me why Rust looks slower than C# because that is not discernible from this information).

Performance was measured in three parts.

  • Prepare: The act of re-packaging the data into the transmissible structure for serialization. For serde types this only matters if you would not normally store the data in this structure. For Bebop and Protobuf this is a required step.
  • Encode: Serialization of the data into a pre-allocated byte buffer.
  • Decode: Parsing of a byte-buffer into the correct structure and ensuring it is valid data.

A comparison of Bebop, Bincode, JSON, MessagePack, and Protocol Buffer encoding and decoding speed in Rust

How do I incorporate this awesome serialization system?

Glad you asked.

For Rust, the best place to start is with the bebop and bebop-tools packages which have further documentation.

For other languages (and still good for Rust too :), check out the Bebop Readme which links to lots of good resources.