Permissioned Data PDS Lexicons

I wanted to share a first pass of the PDS lexicons for the permissioned data protocol to get some feedback while the paint’s still wet!

You can check them out on my working branch: Comparing main...permissioned-data · bluesky-social/atproto · GitHub

They’re all under the com.atproto.space namespace: atproto/lexicons/com/atproto/space at permissioned-data · bluesky-social/atproto · GitHub

Confidence rating on Lexicons is probably like ~70-80%. Broad strokes, I think the shape is there. Though there’s probably still a couple dumb little mistakes in there or places where we’ll want to tweak the wording. And as we continue to use it to build a sample app, we’ll likely also find some functionality that’s missing.

Confidence rating on the code/impl is much lower. It might be useful for understanding the general shape, but don’t over-index on it.

I have a rough goal of having a proposal & alpha version on the PDS by end of June.

14 Likes

I created an alternative implementation of LtHash in Python as I read the specs, about 200 lines, and I’m releasing it under 0BSD (public-domain equivalent): tangled.org/strings/mfzx.net/3mn454avrbq22

To my knowledge, this implementation is entirely compatible with the TypeScript code:

  • both achieve commutativity (order of insertions/deletions does not affect the hash), additivity (a hash of {A, B} can be produced from adding the hash of {A} to the hash of {B} or vice versa), and subtractivity (a hash of {A} can be produced from subtracting the hash of {B} from the hash of {A, B})
  • both produce the same bytes for the empty set
  • both produce the same bytes for a given DID {b'did:plc:z72i7hdynmk6r22z27h6tvur'}
  • both produce the same bytes for the snapshot vector {b'atproto', b'space'} included in the branch
  • both have reliable round-trip conversion

If it proves to be robust and useful to anyone reading, I openly encourage you to copy it, use it, modify it, publish it elsewhere, and so on. :slight_smile:

1 Like

If I’m reading this correctly all members get full read/write permissions to the space.

I think it would be worth expanding the possible roles a member can have. I can imagine a few use cases where I might want to have a non public space with a list of members that I only give read access to and a subset that I give write access to as well. For example, a moderated group where new members start out as readers until they’re approved to contribute, or a community broadcast channel where there are a few moderators who post announcements but most members are read-only. These are common enough patterns across other platforms and I’d love to see the Atmosphere be able to support them.

I also think, although there are fewer use cases, it’s worth offering the ability for write access without read access. For example if analytics were implemented in the future, this would be a great way for AppViews to return view/interaction analytics back to the creator privately. Once it’s in the creator’s owned space, and because it’s remaining in the atproto ecosystem, any app built in this analytics space could be available for the creator to view these metrics in.

What I think is confusing is the space-uri format string I guess in most lexicons it is used to identify only the space ( ats://// ) but the space-uri format allows the full uri including the optional parts //. In the lexicons is not defined whats happens if you use a full space uri for the space identifier in a call to for example com.atproto.space.putRecord using different values for repo, collection and rkey.

I think there should be two distinct format strings defined one for space only and one for the space record ?

Yes that’s right. Tho more advanced authorization semantics can be created on top of spaces by, for instance:

  • creating more than one space under the same authority using something like an Arbiter
  • encoding additional write rules in records that can then drive application logic/views

Instead of handling complex authorization rules in the protocol, we wanted to represent access using the simplest possible permission structure: a memberlist defining the access & sync perimeter. But applications can then build arbitrarily complex read/write access semantics by composing spaces & layering on additional application logic.

I’ll be getting into this more in my next Permissioned Data Diary!

1 Like

Hmm that’s fair. Though URIs can always be partial. Even at-uris may just be at://did or at://did/collection.

If you were to submit a full (6-part) space URI to putRecord, I would expect the implementation to error. Lexicons can only encode so much validation in the schema itself. At some point the application has to step in and ensure the inputs actually make sense. Though I agree that this hurts the self-documenting nature of lexicons.

Could you explain why the commit signature exists?

My understanding is that the commits only prove that the repo owner (or space owner, for member lists) signed off on the ikm, which intentionally doesn’t prove anything about the contents of the repo (or member list). Anyone with this ikm and signature can craft commits for arbitrary space repos and member lists by that same repo owner (unless they change their verification key). The commits must be retrieved from a trusted source, because the commits themselves only authenticate useless data (the ikm). Why is this ikm authenticated at all?

In a public context (ATP), wide redistribution of content is possible because the commit structure provides authenticity and integrity. Any service that wants to fetch/receive a public commit can contact any Jetstream or Relay, and independently verify that what they receive was approved by the account responsible for producing the commit.

The confidentiality of commits is not a concern, as publishing a commit to any Relay(s) is traditionally interpreted as giving services the right to redistribute the commit to any other service on the network, with user-declared preferences taking priority when expressed.

If I understand it correctly, in a permissioned context (ATS), any service that wants to fetch/receive a permissioned commit would only contact the single, authoritative PDS for the account producing the commit, as indicated in the account’s DID document. The layers beneath AT (TLS, TCP) already provide authenticity and integrity, as well as the desired aspect, confidentiality, so it might seem redundant to provide an extra guarantee of only authenticity through signing only the IKM, but the authenticity provided by TLS is not the same as the authenticity provided by the verification methods in the DID document.

The authenticity provided by TLS would prove that the PDS approved the contents of the commit; the authenticity provided by keys in the DID document would prove that the account approved the contents of the commit. The commit has to include some signature information, and it’s not a good idea for the commit to be authenticated against the PDS (requires trust in the PDS-controlled TLS key), instead of being authenticated against the account (requires trust in the DID document).

The authenticity provided by the verification key shows that the account approves of the ikm, but not the meaningful content of the commit. The PDS determines the repo hash and other meaningful commit info, and it can use any valid ikm & signature from the author account.

Since the ikm & signature don’t prove any of the rest of the commit, a reader must rely on the authenticity of its TLS connection to verify that the repo hash is authentic. With the current design, the reader must also verify that the author account approved of the ikm, which effectively verifies that the author approved of a space commit with unspecified contents at an unspecified point in time, with no guarantees against reuse. (Technically this second verification proves that the author approved of an arbitrary 32-byte message at an unspecified point in time, but I don’t think that the atproto verification key signs any other 32-byte messages right now.)

We can still rely on TLS for data integrity & authentication of the server.

Ultimately, this is about retaining the atproto’s data model which roots authority in the DID, not the server. Falling back to TLS changes the security, trust & authority of the protocol.

A (non-exhaustive) example of this in action: consider a self-hoster who loses access to their domain name (maybe they forgot to renew & it expired). With location-based authority (TLS), whoever has their domain now can post as them. In atproto, this should not be possible unless that person has control of their DID or the key material referenced by the DID.

This is a good point & we should probably bind in some extra information into the signature. Something like sig(sha256("atproto-space-commit-v1" +space_ref + rev + ikm)

This would prevent ikm-reuse

2 Likes

+1. I think we both agree that the sigs are currently too weak to serve their intended purpose, and are demonstrating this by comparing to alternatives that lie in opposite directions (no sigs vs. stronger sigs). Stronger signatures are a good idea so long as the sacrifices to deniability don’t outweigh the benefits of a stronger tie to the author’s DID doc.

Apologies for not being more on this discussion – it’s obviously highly relevant to us, but has come at a particularly busy time for me, with some travel and life things intersecting.

As a matter of process, I’ve shared similar thoughts in the past and I’d encourage us all to slow down a little and ensure that the community has time for discussion before presenting solutions? The work that @dholms.xyz has put in is really awesome and much appreciated, but I think risks losing consensus in the “here’s the protocol, mostly baked” presentation.

Maybe I’m wrong, but my take on many of the lexicons would be slightly or possibly a lot different, and e.g. permissioned roles are fairly core to our needs with Roundabout. We can jump through hoops to make it work, of course, but I think it probably warrants further discussion outside the Bluesky team.

A couple of initial thoughts:

  • I don’t love the divergence on applyWrites/createRecord/etc. My fairly strong preference here, for a variety of reasons, is that the existing record modification methods get a space param, and retain their existing authentication semantics.
  • Perhaps I’m reading the intent of the lexicon incorrectly, and the com.atproto.space.* record modification methods are for the space itself, not for records that users are intending to make available to the space? Even in that case, though, I’d argue the existing mechanisms are fine.
  • The getMembers approach maps neatly to how we’ve modeled this stuff for Roundabout. We’re using a following list, but it’s the same thing, so that’s great!
  • We have two “members lists” in our “spaces”: regular community members and stewards (admins). Maybe that would be a lightweight way to handle the desire for different scopes?
  • For our purposes, I think I would prefer to be able to call subscribeRepos() on a PDS with an appropriate token and get updates for a space. Or, perhaps, that would maintain parity with the existing approaches. More generally/strategically, I think it could be helpful to investigate what an actual implementation would look like. Obviously either push or “private firehose/relay/jetstream” subscription semantics would work, but my sense is that there’s value in retaining api parity with the existing atproto flows. I have a bunch of tooling that would need a significant refactor to work in this proposed world, and I have a number of side projects that work great with public data and should work great with private data. I’d really prefer to avoid having parallel data-ingestion code-paths and patterns for these sorts of things.

Anyhow, hope that’s helpful! I wonder if it would be helpful to gather some folks who have private data implementations on a call to discuss some of these things before the implementation work gets too far ahead of the community’s design requirements?

2 Likes

Yes. It’s literally just @dholms.xyz working on this right now while @bnewbold.net is on break AFAIK and him having to do community management as well is a lot of work.

Hey bsky folks it would be great if you could add some bandwidth of folks putting time in to facilitate community discussions. Even this space is out sourced.

If you’re already doing this behind closed doors … please do it with the doors open.

2 Likes

That’d be great. Perhaps @baldemo.to could help shepherd us into an open community call?

Everything to do with permissioned data is closely intertwined with community standards, so the stakeholders are largely the same.

1 Like

Sounds good! I’m a bit overloaded over the next few days, but I’m happy to volunteer to pull something together. fwiw, I do think there’s a meaningful difference between community standards (governance) and private data, but agree they’re closely linked. Big exciting questions!

I also wanted to chime in with some thoughts after following the discussions around communities and the most recent data diary. I outlined my use case in #855, but I’ll briefly describe it here since it differs quite a bit from the community/group-like use cases that are being discussed a lot:

  • private personal posts to an audience, where only the owner knows the member list of the audience

  • anyone in the audience can reply and see the replies of other audience members

  • this maps to things like private facebook posts with a custom audience, google+ posts to a specific circle, or close friend stories on instagram

  • allowing PDS host choice and data portability

  • interop with apps using same lexicons, including reply and full thread sync

Allowing the member list to be fully enumerated by apps on behalf of anyone in the space as a side effect of credentialed sync via getMemberOplog / getMembers (including compaction recovery) doesn’t feel private enough for this use case. I generally support the permissioned data shape so far, but I’d need viewer-scoped member list responses for SpaceCredential holders (while owner OAuth can still see the full list for administration).

There was a suggestion in #855 to use an arbiter in front to modify how getMemberOplog, getMembers, and any recovery path that enumerates DIDs work, but I’m concerned this means the space data would not truly be portable and would always be tied to some bit of my infrastructure. It would also mean every product of this shape of personal data would have to host its own arbiter or fork the PDS implementation, with each one possibly differing slightly in implementation.

I’d like a basic option for viewer-scoped member listing at the protocol level. My suggestion would be to add a memberListSync: full | ownerOnly option on createSpace. This obviously doesn’t prevent inferring participants from replies in synced records, but it at least prevents sync from exporting the full graph through member list APIs. Any other functionality can definitely be layered on top “above the waist” as mentioned in data diary 6, but something like this bit feels fundamental for space audience privacy + interop.

I’m happy to join in on any community call!

2 Likes

Heya! Thanks for all the feedback :folded_hands:

I’ve been writing my permissioned data diaries for the last few months and collecting & integrating feedback from folks along the way. My goal was to explore the design & solution space in real-time in collaboration with the ecosystem. Generally, my read is that the reception has been positive! I haven’t encountered any major deal-breakers with the general shape of the solution. And most of the folks working in the private data & communities space have signalled that they’re happy with the general shape of the design. The energy from teams like Habitat & Roomy has shifted to building the higher-level abstractions on top of the lower-level data protocol.

I think the problem may be less that it’s closed & more that it’s scattered across Bluesky threads, Leaflet posts, Discourse posts, Discord channels, etc. It’s a lot to keep track of. I have a somewhat unique lens on this because part of my full-time job right now is to wrangle these discussions (and even I still miss a lot!). Would it be helpful to do weekly or bi-weekly posts to this forum that are basically “things that happened with permissioned data this week”? I could include links to discussions, prompt questions, & try to sign-post rough timelines around decisions & expectations for releases.

I really want this to be collaborative & ensure it meets folks’ use cases, but I also want to avoid getting bogged down in too much process. It’s always a balance of course. Happy to hear feedback/thoughts on how to structure that as well.

That sounds great! I would definitely join for that as well!

There is a meaningful difference between the two! But I also think a lot of the meaningful stuff is on the community standards side. The protocol is intentionally kept very simple in terms of authorization semantics & things like roles (more on that here). The work Zicklag is doing with the Arbiter (good in depth post on it here) shows the type of rich application & authorization semantics that can be built on the permissioned data protocol.

Is there any particular reason for wanting to use the existing methods? One reason for separating them is that they actually don’t retain their existing authorization semantics. Different OAuth scopes are needed for writing to a space vs to a public repo collection. Roughly, I think the “space type” and “space did” are the main axes that you can authorize on. I get into that more in this post (sorry for all the links lol :sweat_smile:).

I laid out a rough version of my thoughts on this in my Big Picture post. I think doing a stateful websocket is going to be hard to manage with all the credentials that applications will be juggling. Unless you do a Websocket-per-space. But the downside of that is that many of these spaces will be very low throughput (especially on small or self-hosted PDS distributions). I think write notifications + pull-based sync (with HTTP/2 support for connection re-use!) is basically as close as we can get to subscribeRepos while owning up to those limitations & the lack of relays. I’m very interested if folks have other ideas!

I also hope that dev-tooling can help paper over some of the differences between the public & permissioned protocols.

3 Likes

This is interesting… And I’m sorry I did totally miss your previous post!

I’ve been taking the member list for granted. But you’re right that it isn’t necessarily strictly required. I need to think through the tradeoffs of not having it.

I’ll try to hop over there & share some thoughts soon :slight_smile:

1 Like

Super appreciate all the work you’ve been doing on this. I think maybe now that the shape of things is becoming more clear, there’s value in having a conversation that’s a bit more consensus-driven, versus the requirements / technical shape gathering that you’ve been doing to date?

Part of this is very much in terms of building buy-in from the various interested parties. My concern is that if your proposal doesn’t match folks’ expectations in fundamental ways, we could end up in a not-ideal situation where core PDS mechanics are begrudgingly or not-at-all adopted by the community. Speaking for myself, there are a lot of various threads at this point, and keeping track of them all is fairly tricky, which means that the (probably incorrect!) impression I have from the “outside” / limited-time view is that “@dholms.xyz is taking input under consideration and will tell everyone what the protocol will be” vs “we’re working on this together and building a consensus that @dholms.xyz / Bluesky are happy to support on ‘mainline’ PDSes”

Is there any particular reason for wanting to use the existing methods? One reason for separating them is that they actually don’t retain their existing authorization semantics. Different OAuth scopes are needed for writing to a space vs to a public repo collection. Roughly, I think the “space type” and “space did” are the main axes that you can authorize on. I get into that more in this post (sorry for all the links lol :sweat_smile:).

Yes! So, I think the authorization for writes doesn’t lie at the PDS level at all. There’s nothing stopping anyone from writing to a space on their own hacked up PDS, even if some PDSes enforce these constraints, if I’ve understood the full proposal correctly (i.e., data living on individuals’ pdses). Since the “authorization” lives at the intersections of community pdsuser pdsappview policy it seems kind of redundant to implement separate methods.

Error handling is a separate question, but I do worry about having to proxy an auth request to the space every time the user’s PDS wants to make a write.

For reads the situation is obviously different, and I think I could be convinced for the alternate approach, especially if it’s push-based. For that mechanism, though, I could imagine something that relied on the ‘central’ community PDS being “special.”

Specifically, having a subscribeWrite method that takes a destination as an argument and that requires sufficient group auth in order to subscribe to notifyWrite callbacks might be a more generally useful mechanism? The space PDS could (but wouldn’t have to!) call subscribeWrite on all new members, but for differently factored architectures, we could imagine a separate type of private relay server that could handle those events.

This would map it quite closely to the subscribeRepos semantics (and, I would argue, could be implemented as an internal filter over subscribeRepos with a parameterized set of auth tokens. But, that gets a bit messy for a variety of reasons, so the notifyWrite mechanism does seem fine. As a side note, having a space service (vacuum vs relay because space is a vacuum, ahem) that collects all of the space’s posts would mean that the appview could expose a pds/relay subscribeRepos or jetstream-style subscribe authz’d with a valid member token could be a nice simplification.

Does that make sense? Have I missed something obvious?

1 Like

I just wanted to chime in with my perspective on this.

When we decided to go fully on protocol for Roomy, I read up on the permissioned data proposal, ready to make a counter proposal if necessary, and get things going to make an implementation so we could make sure that things would actually work for us.

I didn’t like the sound of things at first, but after really thinking things through I really appreciate the simplicity of it, and that it lets us extend it in rather sophisticated ways, as the space host, without losing interoperability.

Me and @meri.garden mentioned some of our concerns about the proposal on Bluesky and got a prompt response to them ( they were about public perm spaces and reducing read/write flags to just membership, and both points were incorporated into the latest diary ( we weren’t the only ones who mentioned them ) ).

Otherwise we have had no objection to the shape of the proposal and immediately started working on the extra layers on top that we were going to need.

So as far as how well input has been taken and the community has been involved, I’m satisfied so far. :slight_smile: I’m not sure if there are better ways to organize effors, but I haven’t felt a need to change anything or get more attention to sticking points or anything like that.

I think possibly a lot of people in the community who are interested in private data are on the “I’ll just wait for them to come out with it” wagon, which makes sense because it takes a lot of time to figure this stuff out and keep up with things.


Anyway, as far as details around the API and how it compares to the public ATProto API, I don’t have a ton of thoughts, and maybe that does need some iterating. But I’m satisfied with the affordances laid out in the proposal so far. As long as we have private spaces of some sort that we can control membership of with our own server, and we have some way to get a “garden hose” for the community, then I think Roomy’s needs will be satisfied.

I appreciate so far that the shape of the proposal at an abstract level, matches the division of things that ATProto already has shown working: i.e. separating the data and the indexing. And it seems to add just enough to get privacy.

4 Likes