I’ve spent quite some time on how floats could be added to the ATProto data model. Here’s the full write up: Floats on ATProto | vmx - the blllog.
Feel free to discuss it here or reply to my Bluesky post at @vmx.cx on Bluesky .
I’ve spent quite some time on how floats could be added to the ATProto data model. Here’s the full write up: Floats on ATProto | vmx - the blllog.
Feel free to discuss it here or reply to my Bluesky post at @vmx.cx on Bluesky .
Hey! Thanks for the proposal. I appreciate the effort that went into this, and it’s clearly well thought through.
If my understanding is right, round-tripping requires either being schema-aware or using a custom JSON parser (for identifying “.” in numbers). I’m not saying those are unworkable, but they are a bit unfortunate. We’ll need to think through the impact/implications.
A couple meta points on process:
If just having something for floats is imperative, adding a string format for “float” is much lower lift & less of a spec change. Kinda ugly and probably makes no one happy, but sometimes that’s the sign of a good compromise ![]()
The nice thing is that you don’t need a custom JSON parser. Usually all parsers do that already.
In which time frame. Surely the sooner the better, but there isn’t really much urgency.
I’m happy to start a discussion there. Though I guess it’s again a matter of timing. I don’t want to derail things, I’m sure there are working group things with higher priority.
I thought about that. In terms of “breaking things”, I don’t think it would be more or less harmful, so let’s just do the real thing. I’m happy to spend more time on it. I guess I still haven’t fully grasp the impact on where things could break.
I also want to chime in that it is great to see this all gone through with motivating examples, good write-ups, and working code.
I’m pretty interested in this, but want to verify a few things with experiments. In particular, which popular languages reliably read and write integers and floats differently from JSON? And maybe also produce some test vectors and worked example code to demonstrate corner-cases.
I’m overall pretty hesitant to rely on JSON text encoding including ‘.0’ or not to distinguish integers from floats throughout the numerical range. At the end of the day the JSON data model only has a single numerical type (the core issue here), and relying on common behaviors doesn’t feel very robust.
I think it is actually relatively common for atproto implementations to need to work with un-typed / no-schema data and want to be able to pass it through multiple contexts and have it come back with the same CID. I think you have broad experience with this sort of thing, but just to lay out some of my thinking, you might want to persist record data in PostgreSQL as JSONB type (supporting rich queries), read it back later, and be able to confirm the CID version. You might get record data back from an HTTP API in JSON format, with a CID along side it, and want to be able to confirm the CID matches the data. The ‘unknown’ Lexicon type allows stuffing arbitrary bytes in.
What i’d be kind of thinking of is to have an unambiguous JSON encoding, and then CBOR can use the obvious/efficient typed encoding. The JSON encoding could look like the existing CID link and bytes special objects:
{
"$float": 123.456
}
(the float value itself could also be in string format, or even base64 or hex encoded)
This isn’t super fun/aesthetic in raw JSON, but for the subset of application code which is schema-aware, this could all be transparently handled, the same way bytestrings or CID links are today. Though maybe a “float” string format in Lexicon would also work this way (get converted to/from floats via codegen).
In addition to the “is it a float or an integer” ambiguity, we’d need to settle on how special float values are handled, and get that tested and rolled out to the entire ecosystem. I think IPLD DAG-CBOR didn’t get too far in the weeds on this, but I’ve seen a couple different deterministic CBOR flavors floating around the IETF with different rules about this (eg, are NaN and +Inf/-Inf allowed, with both signs or not, etc). There is work happening there, and it should probably be possible to settle on a spec, but I think research/consensus would be needed. I personally have lingering anxiety knowing how complex hardware support for IEEE floats is, and would want to do some more research and experimentation that things like FPU rounding mode wouldn’t come in to play. It would really suck if atproto didn’t work reliably or in an interoperable way on obscure CPU instruction sets or old computers or something like that. This might just be FUD, but I’ve been bitten by softfloat, weird libc/newlib ports, things like that.
Thanks again @vmx.cx for the proposal! Chiming in here as well to say that it would be great to see this, although I understand that there’s a long backlog of protocol-related things that need to happen first.
For a couple of months now (except for when my code crashes
) I’ve been posting astronomical transients from the NASA GCN onto the protocol, in a machine-readable form that science cases could hook into. This alert stream is really just a small test - in principle, astronomy has hundreds of thousands of daily alerts that we could post - but so far, I’m seeing low latency and ease of use that would make using atproto for data streaming very attractive to us in astronomy. Transients aren’t my main area of research, but I’ve reached out to a few people who do it full-time and want to collaborate on integrating atproto into more of our workflows.
The catch is that most of our data is floating point, such as data on this gravitational wave or this gamma-ray burst. After a lot of trying, the only consistently functional solution that I have right now is to stringify the entire external data, which is quite ugly - but since most fields are floats, or even arrays of floats, coercing an external schema full of floats to work with atproto becomes a massive headache.
Arrays of floats or mixed type arrays are particularly painful right now on atproto, and deserve a ‘special’ mention. For instance - should you stringify the whole thing? When it contains floats only sometimes, do you stringify always? How can you anticipate that? How can you tell what a stringified float is vs. a string that could be a number but isn’t? It quickly becomes a very big headache, centered around what is probably the most common type of data for us.
Now, it might then seem a bit like using atproto for streaming scientific data is a stretch, or a niche issue that isn’t worth solving. But I think that it is actually perfect for streaming data, and I could envision a world where all scientific metadata (or if it’s small/time-critical like our transients, the data itself) gets posted to the atproto. It just has too many advantages and beats every single other solution I’m aware of:
So, motivating post aside, I think that having floats on the protocol somehow would be a very useful goal to be able to capture the scientific ecosystem. atproto is a lot harder to sell for that purpose right now because floats don’t have first-class support, and it’s an awkward part of my sales pitch right now because it just doesn’t make atproto feel like a serious fit for scientific data. I’d be in favour of vmx’s proposal, or at least something - like a dedicated string-float type that is recognizable and could easily be encoded/decoded, maybe even at the PDS level somehow. But honestly, if there’s a way to bake support for floats into the protocol on the most basic level, then it would be fantastic long-term for being able to benefit the scientific ecosystem, and probably many other use cases too.
(P.S.: I live in Vienna - if it would be useful to have me at IETF or even just to meet at the side to chat & coordinate, then I’d be happy to)
Not derailing at all! There need to be experts and champions for different things in the protocol. Appreciate the time you’ve put in here.
I won’t be there but I’ve been trying to get someone to organize a community side event. eg pick a time during the week that doesn’t overlap with on-site IETF stuff, run some sessions during the day (floats seems like a great topic), maybe an evening meal.
@cypherhippie.bsky.social is also in Vienna and maybe has a room at the university that can be used?
I’ve been using scaled integers very successfully for both geo points (though I’d prefer to be using a “real"“ geo lexicon) and for recording sensor data. See “info” here, and the “observation” section in particular. This is a similar method to how protobuf stores floats, and avoids the “stringification” issue, which I looked at but dismissed as too messy and error-prone. So far, it’s worked very well.
I’ve worked with GeoJSON before and true floats are definitely a bit of a footgun there.
(They give you vastly more precision than needed, so without a custom serialiser it’s way too easy to blow up the file size with decimal digits for real-world administrative data.)
I think that the appropriate scaling factor is too app-specific to set in stone in a generic geo lexicon though, since it’s always a tradeoff with file size. A location scrobbler probably doesn’t need sub-metre precision, while for a geocaching app that can be appropriate and for mapping building outlines it’s basically a requirement.
Fixed integer scale factors will work in some cases, but it becomes inefficient if you have a very big or small number. I guess an on-protocol float format that can do varying scale factors could have an integer part in for the number and a scale factor, all defined in an object.
I would still rather have first-class floating point number support as this could get annoying to parse if you have many in one record, though it would work on-protocol right now. E.g., 7.536×10^(-55) could be:
{
$type: "community.lexicon.float",
num: 7536,
scale: -58
}
I’m a bit skeptical of scaled or fractional integers. They are a different numerical model from IEEE floats, which I think is what most folks would be expecting here. They are popular in some fields (eg, representing monetary values), but those are usually distinct, and can be represented as simple integer arrays.
I’m curious to hear from proponents whether the { “$float”: 1.05 } JSON encoding would be acceptable. If the goal is to have the JSON representation of record data match things like GeoJSON, it isn’t going to be one-to-one, a translation will be needed. But if the goal is to map in to programming language representations (structs, dataclasses, records, whatever), with SDKs handling the conversion, I think it works fine.
I think the string “format” option should also still be considered. That would not impact the data model or encodings at all, and be far less disruptive to implement.
Another place that data types are represented is query params in XRPC API endpoints. These basically get passed around as strings before being parsed in a lexicon-aware way, so I think the concerns for adding a float type there are simpler.
If we decide to go ahead, we’d probably want to coordinate with the DRISL folks. The current DRISL spec mentions floats a bit, though recommends minimizing their usage aggressively. They also say to simply not use NaNs etc. I think we’d need to come up more documentation/norms around what that means: what is an encoding library supposed to do if it encounters a NaN or Inf value when encoding? When decoding? Throw an error and treat the entire record as invalid?
The Hypha Coop got a grant to carefully implement the DASL/DRISL specification, and that included a test suite and compliance matrix. This includes float support. You can view it here:
Having thought on it a bit, I think the best venue for this discussion is probably the IETF working group. We do talk about the data model in the very first IETF document (the “repo” draft text). Making a change to the data model used for records is one of the lowest-level / most-core aspects of the protocol, and will need broad ecosystem and implementation buy-in, which is what the standards process is intended to enable. We are kicking off the working group in Vienna in about a month, and I expect the group to be working on the document this would fall under through the remainder of the year (at least). @emily.space and @vmx.cx, it would be great to have you participate if you are able, eg by emailing the list ahead of time and/or doing a short presentation to the WG introducing the topic. It is pretty easy to participate in the WG meetings remotely: you can apply for an automatic free waiver for the remote registration fee (and this is widely encouraged by IETF leadership). I think you could probably also be eligible for a fee waiver for in-person participation, or we (as an ecosystem) can try to find funding to cover that cost.
Something related we need to work through is consistent handling of invalid record data encoding. If this isn’t handled consistently through the ecosystem, it can result in both janky/inconsistent experiences, and abuse vectors. For example, if some implementations enforce the data model strictly (rejecting invalid records), but others do not, then some records will “show up” in some places but not others. If the record creates abusive/harassing content, and moderation tools reject the record, but some app clients don’t, it can become a moderation bypass hack. This isn’t specific to floats in the data model, but does motivate taking care and ensuring the data model is defined and implemented consistently.