I was curious if MongoDB compression can efficiently store IDs if they are represented as string instead in a more compact binary form. So I made a benchmark and measure compression performance of three available compressors: zlib, snappy, and zstd.
Results for MongoDB 4.4.1 of storing 128 bit random values (e.g., UUIDs) as binary (16 bytes) or as base-58 encoded (22 characters):
Binary none | String none | Binary snappy | String snappy | Binary zlib | String zlib | Binary zstd | String zstd | |
---|---|---|---|---|---|---|---|---|
size | 3100000 | 3697229 | 3100000 | 3697196 | 3100000 | 3697150 | 3100000 | 3697196 |
count | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 | 100000 |
avgObjSize | 31 | 36 | 31 | 36 | 31 | 36 | 31 | 36 |
storageSize | 3645440 | 4243456 | 2404352 | 3022848 | 2142208 | 2203648 | 1892352 | 2080768 |
totalIndexSize | 2523136 | 3325952 | 2519040 | 3330048 | 2519040 | 3334144 | 2531328 | 3330048 |
totalSize | 6168576 | 7569408 | 4923392 | 6352896 | 4661248 | 5537792 | 4423680 | 5410816 |
zstd compression looks really good. Moreover, it is clear that storing values as binary is more efficient than as string, even with compression, because compression can compress also binary representation despite values being random. The most compressed string size (zstd, 5410816 B) is still larger than the least compressed binary size (snappy, 4923392 B). Do note though that zlib compressed string (5537792 B) and zstd compressed string (5410816 B) are smaller than uncompressed binary (6168576 B), meaning that those compression algorithms can recover storage lost in string representation. But given that they can compress binary values even more, it seems there are still things to improve in those algorithms.
Note: Compression algorithms generally perform poorly on small data and here we had very small object sizes. This means that insights here cannot be generalized to performance with larger amounts of binary (or string) data stored in MongoDB. (MongoDB does combine objects into blocks to compress to alleviate this issue.)