Jason Thorsness

github
github icon
linkedin
linkedin icon
twitter
twitter icon
hi
4Feb 26, 24

BSON — The Gritty Details

BSON References

You’ll find the best reference for BSON at bsonspec.org and in the implementations of BSON in the MongoDB drivers, such as Go, C, and C#. You might want to pull up bsonspec.org/spec.html to reference while reading this article.

SingleStore extended BSON to support top-level value types, as detailed in the engineering blog here.

Read on for a guided tour through the BSON format itself!

The Beginning and The End

BSON starts with a 4-byte little-endian int32 representing the length of the entire document and ends with a null byte.

Thus, the smallest valid BSON according to the original spec is 0500000000. This can be seen using SingleStore Playground.

Note that in this example and others, I am specifying BSON by providing Extended JSON V2 and then converting it to BSON. When used via a MongoDB client driver and through SingleStore Kai, the data starts and remains as BSON from client to storage and no conversion is required.

SELECT HEX('{}':>JSON:>BSON);
-- length 05000000
-- end    00

The largest valid document is left unspecified in the spec, but MongoDB does not accept BSON documents larger than 16 MiB.

Negative lengths aren’t valid. Why does the spec specify a signed int32? Perhaps to aid parsers by allowing them to use ‘-1’ as a sentinel in length-handling.

Because of the length prefix, BSON documents can be concatenated together in a stream which is the format used by the .bson files generated by mongodump and consumed by mongorestore.

Top-level value types using the SingleStore extension end in a type-code byte instead of zero, and use value-specific encoding. These can be as short as a single byte, for example for the ‘null’ type code.

SELECT HEX('null':>JSON:>BSON);
-- type 0A

The Middle

After that initial 4-byte length, there are zero or more elements. An element is a type code, followed by a null-terminated key, followed by a type-code-specific value encoding. For example,

SELECT HEX('{"a":0}':>JSON:>BSON);
-- length 0C000000
-- type   10
-- key    61 00
-- value  00000000
-- end    00

Note that the null-terminated key means BSON keys cannot contain a null byte — one of the few things valid in JSON but not possible in BSON.

There are 21 BSON type codes detailed below.

01 Double — Starting Out Strong

A typical IEEE-754 double, stored in normal double byte order. A pointer into the BSON buffer at the value position can be directly interpreted as a double.

SELECT HEX('2.0':>JSON:>BSON);
-- value 0000000000000040
-- type  01

02 String — Still Good Stuff

The BSON string type is an int32 length, followed by a utf-8 buffer, followed by a null byte. The length refers to the length of the buffer plus the null byte - unlike with documents, it isn’t inclusive. UTF-8 is a great choice that has stood up over time as the format has become ubiquitious.

SELECT HEX('"abc"':>JSON:>BSON);
-- length 04000000
-- utf-8  616263
-- null   00
-- type   02

03 Document — Recursive Goodness

The value following type code 3 is a BSON document, by the same spec as the top-level document. The document bytes, like all BSON value bytes, are not sensitive to their context, so they can be copied/moved without modification. This property is true of JSON as well and might seem obvious but it’s important to make modifying BSON efficient.

For example, note that the encoded bytes of {"z":null} are the same when they are a subdocument.

SELECT HEX('{"z":null}':>JSON:>BSON);
--                08000000 0A7A00 00
SELECT HEX('{"a":{"z":null}}':>JSON:>BSON);
-- 10000000036100 08000000 0A7A00 00 00

04 Array — Uh-oh

Array is where it gets a little weird.

To quote from the spec:

Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array [‘red’, ‘blue’] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.

With these constraints, storing the key names is redundant. For long arrays of integers, it can add significant storage overhead. Why was this choice made for BSON? I don’t know. But now the BSON ecosystem is stuck with it.

SELECT HEX('[true,false,false,true]':>JSON:>BSON);
-- length   15000000
-- 0:true   08300001
-- 1:false  08310000
-- 2:false  08320000
-- 3:true   08330001
-- end      00
-- type     04

0x30 is hex ‘0’, 0x31 is hex ‘1’, etc.

05 Binary — An Advantage Over JSON

Binary is a way of storing arbitrary bytes in BSON. In JSON this would typically need to be base64-encoded and stored as a string. Like a string, it starts with a length, followed by a subtype byte, followed by the bytes. With binary the length refers to the length of the buffer, not including the sub type. There are a few “well-known” subtypes known and handled specially by some drivers. Subtype 0 is the default and most common and probably the one you want to use.

SELECT HEX('{"$binary":{"base64": "AAAABBBBCCCC","subType": "0"}}':>JSON:>BSON);
-- length   09000000
-- subtype  00
-- buffer   000 000 041 041 082 082
-- type     05

06 Undefined — Deprecated

This is one of the multiple “deprecated” types. It’s still supported by drivers. Undefined is rejected for many purposes such as comparisons in most MongoDB versions. It shouldn’t be used.

SELECT HEX('{"$undefined": true}':>JSON:>BSON);
-- type 06

07 ObjectID — MongoDB’s Unique Identifier

This is MongoDB’s format for identifiers. Rather than use some kind of standard UUID, this is a unique 12-byte format:

  1. 4-byte timestamp, representing the seconds since the Unix epoch.
  2. 5-byte random value.
  3. 3-byte incrementing counter, initialized to a random value.

ObjectIDs are automatically added to documents as the value of the _id field if that field is not already present. The timestamp is big-endian so ObjectIDs are sortable with byte comparisons.

SELECT HEX('{"$oid":"AAAAAAAABBBBBBBBBBCCCCCC"}':>JSON:>BSON);
-- timestamp AAAAAAAA
-- random    BBBBBBBBBB
-- counter   CCCCCC
-- type      07

08 Boolean — True or False

A simple type, with a byte 0 for false and 1 for true.

SELECT HEX('true':>JSON:>BSON);
-- bool 01
-- type 08

09 UTC DateTime — Another Advantage

One of the major advantages of BSON over JSON is the native ability to unambiguously store dates. This is a 64-bit integer representing the number of milliseconds since the Unix Epoch.

SELECT HEX('{"$date":"1970-01-01T00:00:00.001Z"}':>JSON:>BSON);
-- date 0100000000000000
-- type 09

Note that the default JSON serialization of DateTime performed by MongoDB drivers has a problem. Typically, ISO 8601 dates are lexically sortable. However, this does not hold true when the milliseconds are sometimes omitted and sometimes not. The default serialization omits the milliseconds when they are zero, leading to incorrectly sorting dates if you sort the strings. This is corrected in SingleStore’s BSON:>JSON conversion.

10 Null — Value Null

The BSON Null type code requires no value bytes.

SELECT HEX('null':>JSON:>BSON);
-- type 0A

Note that this value null is distinct from the “undefined” type code, and distinct from a “missing” key.

11 Regex — Uncommon

This type is made of two null-terminated strings, the pattern and the options. It’s not commonly used (this is for storing, not using, regular expressions).

SELECT HEX('{"$regularExpression":{"pattern":"abc","options":"i"}}':>JSON:>BSON);
-- pattern 61626300
-- options 6900
-- type    0B

12 DBPointer — Deprecated

This deprecated type stores the name of another collection and an ObjectID. It should not be used. This is an unusual type I think from old days when MongoDB required _id to be an ObjectID. It might not be possible to even use this type via extended JSON.

13 JavaScript — Uncommon

This is for storing JavaScript code in the database. I don’t believe this type has much purpose in modern MongoDB. It’s stored like a string but with a different type code.

SELECT HEX('{"$code":"hi"}':>JSON:>BSON);
-- length  030000000D
-- code    686900
-- type    0D

14 Symbol — Deprecated

Symbol is another type that’s just like string with a different type code.

SELECT HEX('{"$symbol":"hi"}':>JSON:>BSON);
-- length  030000000D
-- code    686900
-- type    0E

15 JavaScript with Scope — Deprecated

JavaScript with Scope is a combined string and nested document. It’s deprecated and unused.

SELECT HEX('{"$code":"hi","$scope":{"a":1}}':>JSON:>BSON);
-- length       17000000
-- code length  03000000
-- code         686900
-- scope length 0C000000
-- scope 'a'    106100010000
-- scope end    00
-- end          00
-- type         0F

This type frustrates authors of BSON serializers (or at least, me) because it complicates the nesting. Beyond objects and arrays, now the serializer has to deal with a third nesting type. For a type that is long-deprecated, the ecosystem and libraries still pay a price.

16 32-bit Integer — Simple Int32

Not much to say about this one!

SELECT HEX('1':>JSON:>BSON);
-- value 01000000
-- type  10

17 Timestamp — Uncommon

SELECT HEX('{"$timestamp": {"t": 1, "i": 2}}':>JSON:>BSON);
-- incr 02000000
-- time 01000000
-- type 11

Timestamp is a 4-byte increment followed by a 4-byte time. It is not commonly used by MongoDB clients.

18 32-bit Integer — Simple Int64

Not much to say about this one either!

SELECT HEX('{"$numberLong":"1"}':>JSON:>BSON);
-- value 0100000000000000
-- type  12

19 Decimal128 — Exact Base-10

IEEE-754-2008 provides for a 16-byte base-10 decimal type. This helps avoid certain odd behaviors with base-2 floating-point types.

SELECT HEX('{"$numberDecimal":"100.00"}':>JSON:>BSON);
-- value 10270000000000000000000000003C30
-- type  13

One oddness of this type is in how it compares to doubles. The same numbers can sometimes not be represented in base-2 and base-10. 1 can, but for example 0.1 cannot.

// returns true
db.collection.aggregate({
  $addFields: {
    a: {
      $eq: [
        NumberDecimal("1"),
        1.0
      ]
    }
  }
})

// returns false
db.collection.aggregate({
  $addFields: {
    a: {
      $eq: [
        NumberDecimal("0.1"),
        0.1
      ]
    }
  }
})

The reason for this is that in IEEE-754 double, 0.1 is actually stored as roughly the following:

0.1000000000000000055511151231257827021181583404541015625

Try it here. The decimal128 representation can store it exactly.

This is a curious novelty for users, but behind the scenes it gets tricky for implementors. Since MongoDB allows all numbers to be comparable, the comparisons between numbers of different bases need to be exact and correct (as with the above).

This type can be useful for financial applications and other applications where exact decimal arithmetic is required, but it can be slower than the other number types for some applications.

FF MinKey — Least Possible

MinKey is a special value where all values other than itself are greater than it is. It can occasionally be useful in queries.

7F MaxKey — Greatest Possible

MaxKey is the same thing, but the other way around.

And That’s It!

Many other formats are more complicated than BSON. BSON’s simplicity is one of its strengths. Despite its rough edges, it has carried the MongoDB ecosystem for over a decade and will likely continue to do so for the next.

 Top