I’ve spent quite some time on how floats could be added to the ATProto data model. Here’s the full write up: Floats on ATProto | vmx - the blllog.
Feel free to discuss it here or reply to my Bluesky post at @vmx.cx on Bluesky .
I’ve spent quite some time on how floats could be added to the ATProto data model. Here’s the full write up: Floats on ATProto | vmx - the blllog.
Feel free to discuss it here or reply to my Bluesky post at @vmx.cx on Bluesky .
Hey! Thanks for the proposal. I appreciate the effort that went into this, and it’s clearly well thought through.
If my understanding is right, round-tripping requires either being schema-aware or using a custom JSON parser (for identifying “.” in numbers). I’m not saying those are unworkable, but they are a bit unfortunate. We’ll need to think through the impact/implications.
A couple meta points on process:
If just having something for floats is imperative, adding a string format for “float” is much lower lift & less of a spec change. Kinda ugly and probably makes no one happy, but sometimes that’s the sign of a good compromise ![]()
The nice thing is that you don’t need a custom JSON parser. Usually all parsers do that already.
In which time frame. Surely the sooner the better, but there isn’t really much urgency.
I’m happy to start a discussion there. Though I guess it’s again a matter of timing. I don’t want to derail things, I’m sure there are working group things with higher priority.
I thought about that. In terms of “breaking things”, I don’t think it would be more or less harmful, so let’s just do the real thing. I’m happy to spend more time on it. I guess I still haven’t fully grasp the impact on where things could break.
I also want to chime in that it is great to see this all gone through with motivating examples, good write-ups, and working code.
I’m pretty interested in this, but want to verify a few things with experiments. In particular, which popular languages reliably read and write integers and floats differently from JSON? And maybe also produce some test vectors and worked example code to demonstrate corner-cases.
I’m overall pretty hesitant to rely on JSON text encoding including ‘.0’ or not to distinguish integers from floats throughout the numerical range. At the end of the day the JSON data model only has a single numerical type (the core issue here), and relying on common behaviors doesn’t feel very robust.
I think it is actually relatively common for atproto implementations to need to work with un-typed / no-schema data and want to be able to pass it through multiple contexts and have it come back with the same CID. I think you have broad experience with this sort of thing, but just to lay out some of my thinking, you might want to persist record data in PostgreSQL as JSONB type (supporting rich queries), read it back later, and be able to confirm the CID version. You might get record data back from an HTTP API in JSON format, with a CID along side it, and want to be able to confirm the CID matches the data. The ‘unknown’ Lexicon type allows stuffing arbitrary bytes in.
What i’d be kind of thinking of is to have an unambiguous JSON encoding, and then CBOR can use the obvious/efficient typed encoding. The JSON encoding could look like the existing CID link and bytes special objects:
{
"$float": 123.456
}
(the float value itself could also be in string format, or even base64 or hex encoded)
This isn’t super fun/aesthetic in raw JSON, but for the subset of application code which is schema-aware, this could all be transparently handled, the same way bytestrings or CID links are today. Though maybe a “float” string format in Lexicon would also work this way (get converted to/from floats via codegen).
In addition to the “is it a float or an integer” ambiguity, we’d need to settle on how special float values are handled, and get that tested and rolled out to the entire ecosystem. I think IPLD DAG-CBOR didn’t get too far in the weeds on this, but I’ve seen a couple different deterministic CBOR flavors floating around the IETF with different rules about this (eg, are NaN and +Inf/-Inf allowed, with both signs or not, etc). There is work happening there, and it should probably be possible to settle on a spec, but I think research/consensus would be needed. I personally have lingering anxiety knowing how complex hardware support for IEEE floats is, and would want to do some more research and experimentation that things like FPU rounding mode wouldn’t come in to play. It would really suck if atproto didn’t work reliably or in an interoperable way on obscure CPU instruction sets or old computers or something like that. This might just be FUD, but I’ve been bitten by softfloat, weird libc/newlib ports, things like that.
Thanks again @vmx.cx for the proposal! Chiming in here as well to say that it would be great to see this, although I understand that there’s a long backlog of protocol-related things that need to happen first.
For a couple of months now (except for when my code crashes
) I’ve been posting astronomical transients from the NASA GCN onto the protocol, in a machine-readable form that science cases could hook into. This alert stream is really just a small test - in principle, astronomy has hundreds of thousands of daily alerts that we could post - but so far, I’m seeing low latency and ease of use that would make using atproto for data streaming very attractive to us in astronomy. Transients aren’t my main area of research, but I’ve reached out to a few people who do it full-time and want to collaborate on integrating atproto into more of our workflows.
The catch is that most of our data is floating point, such as data on this gravitational wave or this gamma-ray burst. After a lot of trying, the only consistently functional solution that I have right now is to stringify the entire external data, which is quite ugly - but since most fields are floats, or even arrays of floats, coercing an external schema full of floats to work with atproto becomes a massive headache.
Arrays of floats or mixed type arrays are particularly painful right now on atproto, and deserve a ‘special’ mention. For instance - should you stringify the whole thing? When it contains floats only sometimes, do you stringify always? How can you anticipate that? How can you tell what a stringified float is vs. a string that could be a number but isn’t? It quickly becomes a very big headache, centered around what is probably the most common type of data for us.
Now, it might then seem a bit like using atproto for streaming scientific data is a stretch, or a niche issue that isn’t worth solving. But I think that it is actually perfect for streaming data, and I could envision a world where all scientific metadata (or if it’s small/time-critical like our transients, the data itself) gets posted to the atproto. It just has too many advantages and beats every single other solution I’m aware of:
So, motivating post aside, I think that having floats on the protocol somehow would be a very useful goal to be able to capture the scientific ecosystem. atproto is a lot harder to sell for that purpose right now because floats don’t have first-class support, and it’s an awkward part of my sales pitch right now because it just doesn’t make atproto feel like a serious fit for scientific data. I’d be in favour of vmx’s proposal, or at least something - like a dedicated string-float type that is recognizable and could easily be encoded/decoded, maybe even at the PDS level somehow. But honestly, if there’s a way to bake support for floats into the protocol on the most basic level, then it would be fantastic long-term for being able to benefit the scientific ecosystem, and probably many other use cases too.
(P.S.: I live in Vienna - if it would be useful to have me at IETF or even just to meet at the side to chat & coordinate, then I’d be happy to)
Not derailing at all! There need to be experts and champions for different things in the protocol. Appreciate the time you’ve put in here.
I won’t be there but I’ve been trying to get someone to organize a community side event. eg pick a time during the week that doesn’t overlap with on-site IETF stuff, run some sessions during the day (floats seems like a great topic), maybe an evening meal.
@cypherhippie.bsky.social is also in Vienna and maybe has a room at the university that can be used?
I’ve been using scaled integers very successfully for both geo points (though I’d prefer to be using a “real"“ geo lexicon) and for recording sensor data. See “info” here, and the “observation” section in particular. This is a similar method to how protobuf stores floats, and avoids the “stringification” issue, which I looked at but dismissed as too messy and error-prone. So far, it’s worked very well.