Thanks again @vmx.cx for the proposal! Chiming in here as well to say that it would be great to see this, although I understand that there’s a long backlog of protocol-related things that need to happen first.
For a couple of months now (except for when my code crashes
) I’ve been posting astronomical transients from the NASA GCN onto the protocol, in a machine-readable form that science cases could hook into. This alert stream is really just a small test - in principle, astronomy has hundreds of thousands of daily alerts that we could post - but so far, I’m seeing low latency and ease of use that would make using atproto for data streaming very attractive to us in astronomy. Transients aren’t my main area of research, but I’ve reached out to a few people who do it full-time and want to collaborate on integrating atproto into more of our workflows.
The catch is that most of our data is floating point, such as data on this gravitational wave or this gamma-ray burst. After a lot of trying, the only consistently functional solution that I have right now is to stringify the entire external data, which is quite ugly - but since most fields are floats, or even arrays of floats, coercing an external schema full of floats to work with atproto becomes a massive headache.
Arrays of floats or mixed type arrays are particularly painful right now on atproto, and deserve a ‘special’ mention. For instance - should you stringify the whole thing? When it contains floats only sometimes, do you stringify always? How can you anticipate that? How can you tell what a stringified float is vs. a string that could be a number but isn’t? It quickly becomes a very big headache, centered around what is probably the most common type of data for us.
Now, it might then seem a bit like using atproto for streaming scientific data is a stretch, or a niche issue that isn’t worth solving. But I think that it is actually perfect for streaming data, and I could envision a world where all scientific metadata (or if it’s small/time-critical like our transients, the data itself) gets posted to the atproto. It just has too many advantages and beats every single other solution I’m aware of:
- The extremely low overhead to post records as a producer makes it low-effort to adopt & run. This compares to existing solutions (like everyone self-hosting their own kafka stream for every single project) which require much more technical know-how and compute
- Everything being public and a real-time stream is fantastic for data discoverability, be that among ourselves or for involving the public
- The low overhead to subscribe to events as a consumer (e.g. with jetstream) makes it very easy to hook into event streams, and even many at the same time. (It still needs improvement, but nebra distills this into one easy to use Python function.)
So, motivating post aside, I think that having floats on the protocol somehow would be a very useful goal to be able to capture the scientific ecosystem. atproto is a lot harder to sell for that purpose right now because floats don’t have first-class support, and it’s an awkward part of my sales pitch right now because it just doesn’t make atproto feel like a serious fit for scientific data. I’d be in favour of vmx’s proposal, or at least something - like a dedicated string-float type that is recognizable and could easily be encoded/decoded, maybe even at the PDS level somehow. But honestly, if there’s a way to bake support for floats into the protocol on the most basic level, then it would be fantastic long-term for being able to benefit the scientific ecosystem, and probably many other use cases too.
(P.S.: I live in Vienna - if it would be useful to have me at IETF or even just to meet at the side to chat & coordinate, then I’d be happy to)