kstenerud The problem is that this breaks down once you try to use
SIMD instructions. I'd developed a similar kind of
approach to encoding integers (and ieee774 floats) a
couple of years ago (first byte encodes length and first
bit of data:
https://github.com/kstenerud/bonjson/blob/05b91f6fe7d6b071
86... ). It was very clever and used compiler intrinsics
to get the length in 1 instruction, so 2 instructions got
you the final value, with no branches.But testing proved
that when you move to SIMD instructions, ULEB128
(https://github.com/kstenerud/bonjson/blob/main/bonjson.md
#ty...) or sentinel values
(https://github.com/kstenerud/bonjson/blob/main/bonjson.md
#lo...) win every time because of the parallelization
opportunities.The true irony is that even SIMD text
parsing would outperform this! SIMD is that powerful.
|
> nine_k I think these are different use cases. If you talk
about SIMD, you talk about the CPU and efficient
processing of large numbers of integers. I think that
when a solution like this crops up, it's about storage
or transmission, and dense packing at the cost of
non-uniformity. It's more like time-series databases
pack numbers by delta encoding.
|
> > kstenerud The thing is, most real-world numbers will fit
within 1-3 bytes (even at 7 bits per byte), so
ultradense packing doesn't actually buy much
outside of benchmarks.I spent WAYYYYYYYY too much
time exploring this...
|
> > > nwmcsween This is like string functions, there are some
variants with just crazy SIMD when the mean
string length is ~14-20 bytes
|
> > hansvm I dunno. Varints in the wild tend to be misused,
and there are external proto schemas at work we
have to integrate with which would literally be
both faster and smaller as gzipped json. They're
misused because they have an API encouraging
misuse -- compressing scalars rather than
sequences. Varints are used because they can have
reasonable developer ergonomics while sometimes
improving computer metrics a twidge.On top of
that, for the vast majority of performance/cost
parameter spaces, you're better off both in
developer ergonomics and speed/space slapping zstd
across a flatter binary format, supposing no
better tool fits your use case better. Especially
if your messages aren't exceptionally tiny. You're
not using them in a raw DB or doing raw bulk
analysis on varints (else basically zero choices
of parameters make varints win out), so you're
transferring them somewhere and decoding them.
That decoding step, even for highly optimized
solutions like bijou64, is on par with (slightly
better than, if you have an older datacenter link)
your raw network. If you spend 1s on networking,
you spend 1s on parsing. That's a bad tradeoff
almost always, and that assumes a good varint
solution.Even when varints make sense for some set
of perf/cost parameters, it's still only for
developer ergonomics 99.9999% of the time. Even
simple changes like operating on a sequence of
values rather than a single scalar enable vastly
better CPU/space tradeoffs, and being willing to
craft a proper data layout usually offers huge
gains on top of that.It's interesting that you
pick delta encoding (or, its natural extension,
double-delta encoding often being valuable) for
time-series databases as an example. That's an
obvious case where you have a solution which is
extremely cheap in storage/network/CPU. Varints
suck comparatively, almost always.Not to rip on
them too much, especially since it's nice to have
primitives available which let you not have to do
hard thinking for literally every problem, but
they're not amazing and not a great default.
|
> > the-lazy-guy Stored and transmitted data also has to be
decoded. With modern datacenter hardware
bottleneck is often CPU rather than network or
disk (SSD). It depends on specific properties of
the data. (I used to work on search index
implementation which is about decoding and
intersecting large amounts of hit-lists; and right
SIMD-friendly varint encoding is obviously
crucial)
|
> jaen This doesn't seem particularly hard to SIMD,
especially when the CPU architecture has
"compress/expand" horizontal instructions. The first
byte fully encodes the length, which is not harder
than the continuation bits of (U)LEB128. It's a
basically a common length-prefixed encoding with an
extra subtract added in, so someone has probably
figured out an efficient algorithm.It might be
slightly more instructions than some other serial VL
(variable-length) integer codec choices, but overall I
don't think it's more difficult.The very efficient
SIMD VL codecs tend to stripe (separate) the control
and data bits, so they're in a different design space
anyway.
|
> > kstenerud It can't be done, because the next bytes are
dependent upon the first byte (which only works in
limited circumstances, and where you have constant
spacing between the values).ULEB128 works in SIMD
because there's only one dependent bit per byte,
so you can speculatively decode and then correct
later cheaply. Bijou requires you to check the
first byte and then branch based on the value
using all 8 bits in the decision matrix (to handle
branches 0-247, 248, 249, 250, 251, 252, 253, 254,
255). This absolutely DESTROYS any parallelization
opportunities.Not to mention that non-canonical
sized ints (3, 5, 6, 7) have abysmal performance
compared to unaligned 2, 4, and 8 byte reads on
modern processors.
|
> > > jaen Right, I think we have a slightly different
definition of SIMD: You mean byte-parallel, I
mean "doable with SIMD instructions". I also
didn't imply the performance would be better
than other methods...Even though decoding the
lengths must be serial (since's there's no
unambiguous way to differentiate a tag and
data byte), it's still doable within the wider
SIMD registers, so there's some theoretical
efficiency gain to be had (depending on the
shape of the data).On a general note, the
continuation bit and prefix byte forms are
equivalent, you just broadcast the prefix byte
and compare against an increasing vector to
convert it to a mask. Yeah, there's probably
more fiddly SIMD if there are multiple
prefixes in the register, but doable (it's
just not byte-parallel, you eg. unroll the
serial decode loop 8 times or whatever your
maximum output byte width is, and mask
out).Simplified: // Just maps a byte to its
position in the register
__m128i idx =
_mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,
14,15);
// Broadcast the prefix
__m128i nn =
_mm_set1_epi8((char)prefix_byte);
// Get applicable locations: prefix_byte
contains the length, if byte_pos < len, the
corresponding byte will be set
__m128i m = _mm_cmpgt_epi8(nn, idx);
// If you *really* want a high-bit mask:
m = _mm_and_si128(m,
_mm_set1_epi8((char)0x80));
|
> > > > kstenerud Yeah, sorry, I didn't say that very well.
Single value decoding of Bijou values is
of course trivial in SIMD, but the
performance benefits of SIMD come from
deterministic boundaries across a window.
ULEB128's continuation bit is fixed
position, so it's data independent. One
pmovmskb gives you every boundary in the
window.Interleaved Bijou has no such
signal (tag and payload bytes both span
0x00-0xFF), so finding the boundaries is a
dependent per-value walk with no
opportunities for parallelism.
|
> > > > > jaen There's still speculation though - if
eg. most values are of 1 or 2-byte
length, you can speculate that any
control-valued byte is actually
control. You can even do a
compensation pass to try to fix some
amount of mis-speculations, and then
bomb out if that fails.With that, it's
mostly byte-parallel (though
data-dependent as I mentioned).
|
> > > > > NooneAtAll3 would vector instruction be of any
help? (variable length simd)
|
> itishappy > The true irony is that even SIMD text parsing would
outperform this! SIMD is that powerful.Can you explain
this part a bit? I feel like intuitively (and
therefore probably incorrectly) these should have the
same difficulties.
|
i2talics Non-canonical encodings are actually quite useful for some
applications that need variable length integers. DWARF and
WASM both use LEB128.The problem is linking: a compiler
needs to emit code into independent translation units,
which contain "missing" references to symbols in other
translation units, without yet knowing where all the code
will end up in the final executable. Since we don't know
where the location of other code is yet, we don't know how
big the number representing that location is yet, which
means that we don't know how wide the variable length
encoding of that number will be. If the width changes
after linking, then we have to push around the surrounding
code to make space for the wider integer. Unfortunately,
this changes the location of all the surrounding code, so
we have to recompute all the references!The solution is to
always emit un-linked var ints in the widest possible
encoding (5 bytes for LEB128) that way when the references
are patched during linking, no code is moved around. All
integers can be converted to a non-canonical 5 byte form
that is "wasteful" but its a worthwhile tradeoff because
it solves this issue. Other integers that don't need to be
linked can be packed in a smaller var int form to save
space.
|
> __s I've often done same thing with encoding msgpack maps
while streaming in key/value pairs
|
> > i2talics Neat! It's a useful technique whenever you don't
know or want to defer knowing the size of an
integer until a later time, but need to allocate
space for it up front.I'm wary of introducing
these forced-canonical encodings by someone hyper
focused on "efficiency" and "security" without
reconsidering additional use cases.
|
stebalien I've used LEB128 (with canonicalisation) extensively
and... this looks so much nicer for most use-cases (length
prefixed, supports the full uint64 range without that
extra 10th byte).The downside is the encoding size. LEB128
quickly grows to 2 bytes, but stays at 2 bytes all the way
to 2^14. This is important if you're using these numbers
as tags/identifiers as we were in the multicodec [1]
project, or for network message lengths. bijou64 only
gives you 500 <= 2 byte numbers.[1]:
https://github.com/multiformats/multicodec
|
> Someone > I've used LEB128 (with canonicalisation) extensively
and... this looks so much nicer for most use-cases
(length prefixed, supports the full uint64 range
without that extra 10th byte)If you only want to
encode uint64 numbers LEB128 could easily be tweaked
to fit in 9 bytes in several ways:- using the offset
trick described in this article would remove
non-unique encodings (0x80 0x00 would encode 128)-
never allowing encodings longer than 9 bytes would
mean the MSB of any ninth byte would always be zero,
so you could reuse that, and store 8 bits in any ninth
byte, for a total of 7 bits in each of the first eight
bytes plus 8 in the ninth = 64Both tweaks would lose
LEB128's property that you can find where each number
starts from any byte in the stream, but the encoding
discussed here doesn't have that property either.
|
> b_fiive sup steb, this is expede's work!
|
> > stebalien Sup b5! I always look forward to new work by
expede (and n0).
|
wahern This reminded me of ISO 7816-4 BER-TLV encodings, which
uses the format defined in ISO/IEC 8825-1 (ASN.1 related
spec). Length integer values of 0-127 are encoded in 1
byte. If the high bit is set, then the first 7 bits tell
you the number of subsequent octets. So there's no
offsetting involved, making it slightly less compact, but
also dead simple.EDIT: BUT, BER-TLV does permit overlong
encodings. And I once found and reported a Yubikey 4 bug
related to this. My source code comment for the
workaround: -- The Yubikey 4 has an off-by-one bug which
-- declares tag length of 255 (for the 0x53 outer
-- tag of a certficate DO) when there are only 254
-- bytes remaining in the reply. The reply is
-- chained across two packets, but the off-by-one is
-- probably related to the over-long encoded length
-- (0x82 0x00 0xff instead of 0x81 0xff).
--
-- [snip packet captures]
--
-- Yubico's ykpiv_fetch_object function in ykpiv.c
-- (confirmed 1.4.3-1.5.0) contains a read (memmove)
-- overflow when the declared inner BER-TLV length
-- (of the 0x53 tag) is longer than what was
-- received over the wire. That makes Yubico's
-- library oblivious to the issue. Relatedly, the
-- set_length function has an off-by-one bug (length
-- < 0xff instead of length <= 0xff) which produces
-- an over-long encoded length. That doesn't by
-- itself explain why the Yubikey 4 transmits a
-- truncated logical reply unless the same code is
-- being used.
|
boricj I'm working on a C++ library at work that binds wire data
and application data through token and model layers, which
includes among other things a fair amount of
tokenizers/composers for various formats (JSON, CBOR,
BSON, CSV...).This looks neat, but if encoding/decoding
performance is important, payload size isn't and the
integer is bounded, I would just put a fixed-size integer
into the payload as-is.LEB128 (and JSON for that matter)
can encode integer values of arbitrary length. This
doesn't, which may or may not be important but it's
different.I'll admit that I do not do any cryptographic
work with my library and therefore canonical
representations aren't a huge concern in my use-cases. I
merely provide various configurable limits (max value
length, max depth, max items per collection) in an effort
to prevent infinitely long documents from hogging my
tokenizers indefinitely.
|
omoikane UTF-8 has the same issue ("overlong encoding") where
multiple representations are possible the same code point.
Someone proposed a similar tweak to remove the overlapping
ranges by adjusting the base offset for byte sequences
that are longer than 1. That was discussed
here:https://news.ycombinator.com/item?id=44456073 -
Corrected UTF-8 (2025-07-03, 54 comments)This "corrected
UTF-8" has other problems, but I thought it's interesting
how the shifted-offset idea carries over.
|
billpg I forget where I encountered it, but I've seen similar
encodings that eliminated the possibility of many possible
encodings for the same number by making the length part of
the value.Values 0-127 are a single byte, but if that
first byte has the continuation bit set, not only does
that indicate the next byte has 7 more bits to contribute,
it also moves the base up to the next window.10000000
00000000 is the only way to represent 128.10000000
10000000 00000000 is the only way to represent 16512.Does
this encoding have a name?
|
> ezwoodland According to Wikipedia, git does that:
https://en.wikipedia.org/wiki/Variable-length_quantity
#Remov...
|
> pta2002 I believe that's how the varint encoding used by
protobut works:
https://protobuf.dev/programming-guides/encoding/#vari
nts
|
> > chrismorgan > Drop continuation bits.Clearly not.
|
> > > pta2002 Indeed, I was misinterpreting the OP's
suggestion. Can't edit the comment anymore,
unfortunately.
|
> sparkie Bitcoin has a variable width encoding (`CompactSize`),
but it doesn't prevent overlong encodings - however
there are various canonicalization rules in the
Bitcoin protocol to require minimal encoding.
|
> > teo_zero UTF-8 notoriously doesn't prevent ambiguous
encoding by construction, but only prohibiting it
in the specs. It's known as overlong encoding.
It's up to the encoder/decoder to prevent,
correct, or reject it. This burden on the software
is exactly what TFA tries to eliminate with the
bijou64 format (unfortunately replacing it with
another burden: overflow check).
|
aDyslecticCrow Clever, but one thought crossed my mind;An adveserial
package can claim to have a 255 tagged integer but not
actually have any followup, tricking the payload parser
into an incorrect offset and reading straight off into
followup memory.It's a classic thing to check for when
dealing with variable length strings or binary, but it may
not cross the mind when it's hiding in the
Bijou64_decode(*buff, *cr) function.
|
> gregschlom You have the same issue with LEB128 though, right?
|
> > aDyslecticCrow LEB128 can only trick you by at most one byte,
(depending on the followup data). Bijou64 can
consistently trick you by 8 bytes.In a contrived
example of a pbuf {length:int,
payload:byte[1]}LEB128 can trick you into reading
the payload as part of the length, but then
hopefully trigger a code check against invalid
buffer read. (or one byte outside the struct if
the payload is also malicious)Binou64 can trick
you to read 7 bytes into other memory, before any
buffer size validation is done.It's then not
uncommon to log with a helpful; "buffer with
length: 26624894573377(7 bytes of stolen data) is
invalid", or just crash.It's to the point that
Bijou64_decode should perhaps take "end_adress" or
"max_read" to catch this kind of attack.(If you
dont validate a malicious pbuf, you're in for a
bad time regardless of integer format, but these
int formats add their own way to trigger a buffer
overrun despite a proper check.)
|
> > pixelesque Very likely, but isn't this post claiming that
bijou64 is safer than LEB128 for the situation of
adversarial varints?
|
conaclos This is pretty close to SQLite's varints [0][0]:
https://www.sqlite.org/src4/doc/1433690d7b/www/varint.wiki
|
> adzm I believe SQLite3 uses a somewhat different
implementation:> A variable-length integer or "varint"
is a static Huffman encoding of 64-bit twos-complement
integers that uses less space for small positive
values. A varint is between 1 and 9 bytes in length.
The varint consists of either zero or more bytes which
have the high-order bit set followed by a single byte
with the high-order bit clear, or nine bytes,
whichever is shorter. The lower seven bits of each of
the first eight bytes and all 8 bits of the ninth byte
are used to reconstruct the 64-bit twos-complement
integer. Varints are big-endian: bits taken from the
earlier byte of the varint are more significant than
bits taken from the later bytes.from
https://www.sqlite.org/fileformat2.html#varintThe one
you linked for SQLite4 (abandoned project) is probably
a better approach. I recall that the author has said
that SQLite3's varint implementation is regretful.
|
arkenflame I researched many different varint encodings for a
GraphQL-specific binary format (resulting in Argo:
https://github.com/msolomon/argo ). I ended up choosing
protobuf-style zig-zag varints, but I also found these
interesting:vu128:
https://john-millikin.com/vu128-efficient-variable-length-
in...metric/imperial varint:
https://dcreager.net/2021/03/a-better-varint/vectorizing
VByte: https://arxiv.org/abs/1503.07387
|
willtemperley Maybe someone can explain why an encoder would ever create
the padding bytes allowed in LEB128. I contributed the
parser for LEB128 in apple/swift-binary-parsing and I'm
still none the wiser. I'm genuinely mystified.
|
> scottlamb I can think of two reasons.The first is what they
describe here: as an attack. It's like why would
anyone ever overflow a buffer with shellcode.The
second is that they are implementing a spec that
requires appending a varint length-prefixed field to a
buffer but don't really care about the space
optimization, don't know the field's length when they
start appending it, and don't want to put the field
into a second, temporary buffer or slide it down into
place.
https://github.com/FFmpeg/FFmpeg/blob/468a743af1653a08
f47081... vs say my own code which does the slide:
https://github.com/scottlamb/retina/blob/6972ac4261ce7
bf5b58...
|
> > wahern Third is by accident, including by buggy code, and
the permissiveness means it's easier for bugs to
go unnoticed. See, e.g.,
https://news.ycombinator.com/item?id=48327115
|
> esrauch Let's say you are writing into a byte[] and have a
LEB128 length-prefix followed by a payload, but that
determining the length actually involves nontrivial
encoding work. For example, you have a UTF16 string
and want to write out a UTF8 string, you want to go
over the characters and write them out, but the UTF8
length is not known without doing all of that work.If
you can choose a fixed number of bytes for the length
prefix, you can skip that number, do the encoding and
find out the length, and then come back and fill in
the length-prefix after.But you actually don't know
how many bytes it will take without doing all of the
work to know the payload length (since larger payloads
take more bytes to represent the length).If you allow
overlong representation you can reserve a few bytes
and sometimes it'll just be the effective no-op bytes.
If you don't, you won't be able to.
|
> > willtemperley Thank you for solving that mystery!
|
> cornstalks It allows you to fill in padding in a buffer. For
example, all data in a buffer will be interpreted by a
downstream system, and someone pre-calculated the size
of that buffer. Rather than encode everything twice
(once to figure out the exact size needed, and a
second time to actually populate the buffer) the
buffer size was calculated using foreknowledge of how
many values would be written to that buffer and then
just pessimistically assuming all of them are max-size
so writing will never fail. Another situation is when
you're rewriting part of an already-encoded file. If
you want to change a bit of payload then using padding
bytes gives you more flexibility so you can do that
without having to do any memcpy into a new buffer.It's
uncommon but I've definitely seen it done (with media
containers like Matroska, not actually LEB128) in
extremely high-throughput systems that can't spare any
cycles.
|
> axod Maybe you want to byte align some data, or pack to a
certain size but keep compat. I think they're going to
be rare cases, but I can see it being used.
|
> layer8 The issue is that non-unique encodings are an attack
vector, because parsers may in practice behave
differently for noncanonical (or nominally invalid)
encodings.
|
> > kbolino For example, you have an envelope format that
goes: length prefix in LEB128, message, signature.
One party controls the length prefix and
signature, a different party controls the message.
The message-writing party carefully crafts the
message so that, in isolation, it appears
innocuous, but when wrapped in the envelope, the
first few bytes of the message look like
continuation of the length prefix. Best case is
the receiving party safely fails to parse the
message, worst case is the receiving party
successfully parses the message, verifies its
digital signature, and interprets it differently
than the signer did.
|
> boricj Laziness probably. Maybe there's an argument if you
want to avoid branches and just blast the integer out
in a fixed number of statements/instructions/bytes,
but that sounds a bit fringe.I happen to be guilty of
a variant of this, where I don't bother emitting a
16-bit floating point number instead of a 32-bit one
in my CBOR encoder even if it can be represented
exactly. That one is laziness.
|
> i2talics It's useful whenever you don't know the value of an
integer but would like to allocate space for it now,
and then fill in the value later. Many have mentioned
length-prefixed data, which is a good example. Another
use case is static linking. I believe LLVM uses this
when generating WASM object files.
|
> > willtemperley I think this is probably the real reason such
encodings are considered valid. The webassembly
spec is explicit about allowing valid over-wide
encodings:https://webassembly.github.io/spec/core/
binary/values.htmlMaybe a robust parser would
benefit from a strict mode, disallowing over-wide
encoding.
|
> Chaosvex You wouldn't. It's a strange argument that can be
countered with, "maybe don't do that?"
|
> > willtemperley So why does the spec allow it? Like a good
engineer I read the spec and tested against the
over-wide example encodings given.
|
> > > Chaosvex Because it's not a real standard and there is
no blessed RFC for it. The DWARF spec is as
close as you'll get and it says, "The integer
zero is a special case, consisting of a single
zero byte." So in a way, it doesn't.Either
way, a properly written decoder (and it's like
ten lines) should really not have any problems
with it. I was agreeing with you.Edit: to
clarify, I was talking about the author's
argument being strange, not yours.
|
> > > > willtemperley The WASM spec is more explicit about
over-long LEB128 encoding.Edit: a properly
written decoder is a lot more than 10
lines if you properly deal with integer
overflow and both signed and unsigned
ints.
|
nine_k In short: instead of a truly indefinite-length solution
with a signal bit on the current byte saying whether to
check the next byte, this uses a counter. Values 0x0 to
0xF7 are one-byte integers, 0xF8 to 0xFF use the upper 5
bits as a counter for the number of subsequent bytes. This
limits the maximum magnitude to slightly less than 2 ^ 264
(almost all 33-byte values), which seems to be okay for
practical computations. The proposed standard limits the
supported size to u64 though.The upsides: the size of the
integer is apparent upon reading the first byte, and every
number has exactly one canonical representation. I wish C
strings had been standardized around something similar,
instead on null termination.> ...adversarial input, which
is rarely in the test suite.This made my scratch my head.
My tests for quite pedestrian APIs often contain
adversarial input of obvious shapes. I though that for
anything security-related (like the author's project)
testing against adversarial input would be be a prominent
part.
|
> onlyrealcuzzo > I though that for anything security-related (like
the author's project) testing against adversarial
input would be be a prominent part.They might have a
different definition of adversarial than you.> My
tests for quite pedestrian APIs often contain
adversarial input of obvious shapes.This doesn't seem
like what I would call adversarial.This seems like
standard negative testing or boundary value analysis -
which I would be shocked if they didn't do.
|
harrisi I'm surprised there's no mention in the post or here about
SQLite's varint encoding. Not that it would necessarily
satisfy the constraints, but it's one of the most used
varint implementations.
|
> dgllghr It's not terribly fast. It's faster than LEB128 but
not as fast as vu128 (at least according to
https://github.com/Jiboo/varint_benchmark)
|
> > harrisi The post says the purpose of exploring this space
(which is a fun one) wasn't speed, but
representation. The speed gain was an added
value.I'm not saying SQLite's varint
implementation is ideal for every application.
It's just an implementation that is one of the
most used implementations, if not the most (I'd
bet it is by a large margin though). It just
seemed like a missed opportunity to compare it
with the implementation they landed on.EDIT: Just
wanted to add, thanks for sharing that link.
Interesting!
|
TZubiri >The check is forgotten, optimised away, or never ported.
The protocol's security property silently degrades. This
is the bug class bijou64 is designed to make impossible.
Not by adding more checks, but by removing the one that
mattered - and making the format such that, with no
canonicality check at all, the only encoding that exists
for any given value is the canonical oneHere's two passing
tests for software craftmanship:1) it looks decades into
the past
2) it looks decades into the future
|
michaelmure One nice upside of having a single way to encode a value
is fuzzing: when you work on an encoder/decoder, you can
use a fuzzer and do round-trip comparison until you find
crashes or inputs/outputs that don't match (and therefore
issues in the code). But with LEB128 for example, the
fuzzer quickly learn about those alternatives encoding and
there is not much you can do from there.
|
dzaima Kinda surprised that there's no discussion on that this
basically just does not solve the non-canonicality
problem.Forgetting to do the range check on the
first_byte==255 case and just letting it do 64-bit
wraparound is exactly as much of a plausible bug as
missing range checks on LEB128. Any test suite with the
goal of covering canonicality will trivially cover both
properly; and a programmer that implements things by
reading 7 words into the spec, saying "oh yeah I got this"
and goes to implement what seems simple, will write a
broken version of both.Perhaps the biggest benefit is just
not being associated with a format that tolerates
non-canonicality in other places (though, if bijou64 gains
traction, it'll only be a matter of time for
wraparound-check-less versions to start appearing in
places where the wraparound is fine); and I guess also it
being less annoying to implement the canonicality check,
though hopefully people writing security-sensitive
software aren't ones to skip out on correctness checks due
to annoyingness.In a sense, bijou64 could perhaps even be
more problematic - it invites not doing any range checks
for the smaller inputs because they obviously don't need
it, and so you can just forget to special-case the max
length case; whereas LEB128 makes you already care about
it at the first point it is actually LEB128.(of course,
the format does still have other benefits; enforced
canonicality is just...not one of them)
|
> restalis Your range checking requirement is just one of many
things that may or may not be necessary to someone
using this. Taking them all into account (besides the
fact that may be unfeasible, if there happen to be
some conflicting requirements) would render the
solution to be unnecessarily complex. It's better to
focus on the minimal set of viable requirements and
thus have a base design as simple as possible. For
additional requirements, just go and complicate your
design (and be the only one having to pay the costs
that come out of that complication), hopefully only
when and only for as long it makes sense to do so. For
the need you mention, some kind of container wrapper
may do, with the amount of number's words specified in
it. A good thing is that you'd be able to limit the
use of such container-wrapped numbers only to some
situations (like the exchange of data to and from
unsanitized areas).
|
juancn I like the denormalization of VLE ints (with or without
zig-zag encoding of negatives), it helps support out of
band information, such as nulls and other signals in
serialization protocols with minimal overhead.For example
you can use a denormalized zero to signal null.You can
still define a canonical encoding where denormalizations
have specific meaning or signal an error.
|
HansHamster It feels a bit unfair to say that it is faster by being
able to tell the total length from the first byte and
capping it at 64 bit, while some of the other formats can
store arbitrarily large integers.
I guess you could use another variable length encoding for
the prefix at the cost of some performance and using even
more space...
|
> petermcneeley esp when the number is capped at only 64 bits which is
quite small for some bigInt style numbers.
|
yread Wouldn't something like this also work:
https://en.wikipedia.org/wiki/Elias_omega_coding
I've used to great effect for compression
|
cantalopes I love the random hyperlink underlines on that page
|
> spiralganglion Credit to Roman Komarov who came up with the approach
[1], and Todd Matthews [2] who made the art assets.1:
https://kizu.dev/svg-linked-parameters-workaround/
2: https://www.seaofclouds.com
|
MarkusQ > This causes problems for signed dataGiven that the
context up to this point had been representation of
integers, I initially trip on this. :)
|
apitman This reminds me of the varint encoding used by QUIC, but
I've never implemented it. Anyone know the differences?
|
> raphlinus Similar, in that it encodes length in the first byte.
The differences are:* It does not require
canonicality, it allows multiple encodings of the same
value. To make things even more fun, the QUIC spec
requires shortest encoding in some uses but not
others.* It uses 2 bits rather than cutting out a
range.* It only encodes values up to 62 bits long.So,
some similarities but also some differences.[1]:
https://www.rfc-editor.org/rfc/rfc9000.html#name-varia
ble-le...
|
amluto Just a quick reminder:> This causes problems for signed
data if you ever want to do things like compression since
you need to know the exact bytes that were signed.If you
are verifying a signature by taking some logical data
structure, turning it into a byte string, and calling the
verification primitive on those bytes, you likely have a
design error. You should instead collect bytes, verify the
signature, and then parse the bytes after verifying the
signature. And remember to include enough context in those
bytes so a different message signed for a different
purpose by the same key doesn't confuse you.
|
Aardwolf > The payload is a single contiguous big-endian integerWhy
not little endian like modern CPUs?
|
dekdrop How do I get to this page from the home page?
|
jdougan Reminds me of the Xanadu Humber encoding.
|
RedShift1 This seems quite convoluted just to avoid the "0 can be
represented in more than one way" problem.
|
> bjoli Having all numbers be valid in only one way is a great
idea. So much that I believe webassembly enforced
canonical leb128, at the cost of decoding speed.And
say you have it as part of some other data. If you
want to be able to hash it by the raw memory bytes,
many different ways to represent a number becomes a
problem.
|
> matja > canonicality matters - for signatures,
content-addressing, or any kind of "two
implementations must agree on the bytes" propertyIf
you don't do this properly, you end up with things
like:
- SAML XSW attack due to XML signature wrapping
- ASN.1 BER/DER signature forgery
- Bitcoin transaction malleability attacks
|
> nine_k It allows finding out the length (and allocating
memory) after reading the first byte.
|
> ape4 Comparing a number to zero is something that's done a
lot
|
> > Chaosvex True but also not particularly relevant?
|