[Proposal] AT-URI for cross-service records

tl;dr non-AtprotoPersonalDataServer services might host records, let’s introduce following AT-URI syntax with optional service field to reference those non-PDS hosted data.

at://tangled_knot@did:plc:user/...
     ^^^^^^^^^^^^              ^^^
     |                         path can be any format
     |                         (NSID/rkey prefered, but not enforced)
     service name
     (default to "atproto_pds" when omited)

In this proposal, I’m going to explain why we need cross-service records and proposed AT-URI syntax to reference them universally. There are way more ideas about cross-service records (e.g. do we want universal way to sync them?) but that would be bit off-topic here.

Context

While redesigning Tangled for true decentralization, I found that PDS is pretty lacking. It doesn’t allow collaborative data or private data, its data format is pretty lacking too. It doesn’t seem to be a good idea to store everything to the PDS.

Prior discussions the-case-for-universal-login-and-off-protocol-services and media-pds-service already pointed out similar problems pretty well. PDS can’t match all use-cases. I’ve also seen this non-PDS data server has been discussed often to solve private/shared records.

I discovered that we eventually have to store some data in services other than PDS (I mean specifically AtprotoPersonalDataServer typed service in #atproto_pds name.) Then we want a way to reference those off-protocol data from PDS records and vice versa. There are data which user should own and there are data which group (repo collaborators) should own. Because current PDS spec doesn’t support group-owned data (and I suppose it will never get that,) those group-owned data should be stored in non-PDS service; service which doesn’t have AtprotoPersonalDataServer type and assigned to DID in different namespace (just like #atproto_label service.)

Use-case of cross-service records

This section is to explain why we cannot use existing PDS for all user-owned data and fully cover possible examples. I’m explaining this here because they seem to be possible to implement in existing atproto spec at first glance, but not actually. You can safely skip this part if you want.

1. Group-owned data

PDS can’t serve group-owned data because it doesn’t have ACL needed to allow collaborative edits. Someone can make custom PDS for that or put lightweight proxy around it for extra logic, but that’s not correct AtprotoPersonalDataServer implementation. We shouldn’t force users to use custom PDS implementations to use a service like Tangled. Even worse, those custom implementations might not be compatible with other Atmosphere services. This approach is more closer to Fediverse’s approach. In Fediverse, this approach is ok because the instance represents both data storage and the app serving those data.

2. Record with revision history / 3. Auto-updating records

Tangled specifically needs these. We want full revision history of issue records. We want to create workflow runs regardless of explicit trigger by user and automatically update their status. Even though they are representable in JSON/CBOR format, we cannot use PDS here.

4. JSON projection of non-JSON data

e.g. git commits, git trees, workflow status

When actual data is in different format but projected as JSON. These are technically just auto-updating records.

5. External blob that requires its own way to fetch

e.g. workflow log stream, video stream, really large data
This is what addressed from media-pds-service.

Honestly I’m not sure if we strictly need AT-URI referencing for them. Though those stream should still be owned by the user so their unique id eventually looks like AT-URI:

  • where is this data stored (service)
  • who owns the data (identity)
  • which kind of data (collection)
  • exactly which data from this user (record key)

6. Private/E2EE data..?

I’m not trying to solve the private/E2EE here, but more focused on the collaborative/non-JSON data. Though allowing cross-service records seems to be good basework for future private/E2EE data implementation.

Cross-service referencing

So, if all data are stored across multiple services, it is reasonable to have a universal way to reference them instead of custom identifier spec all over the place. Usually we use AT-URI for data in PDS (exception being blobs but they are attached to the records), so let’s extend the current AT-URI syntax to include the service name where data is stored.

as a reference,
Full AT-URI syntax:

"at://" AUTHORITY [ PATH ] [ "?" QUERY ] [ "#" FRAGMENT ]

Current blessed AT-URI syntax:

"at://" AUTHORITY [ "/" COLLECTION [ "/" RKEY ] ]

We can append ?service={service_name} to blessed AT-URI syntax to qualify current full AT-URI syntax. But honestly I think using userinfo field of URI makes more sense:

"at://" [ SERVICE "@" ] AUTHORITY [ PATH ]

examples:

at://tangled_knot@did:plc:example/org.tangled.pull/<tid>

at://tangled_knot@did:plc:example/commit/<commit-hash>

The path can be any format. NSID/rkey is preferred, but it is fine to not follow if needed. When we cannot define the data schema in lexicon, using NSID doesn’t mean much there.

Fetching off-protocol records

While #atproto_pds records are fetched by com.atproto.repo.getRecord or com.atproto.sync.getRepo, these off-protocol records might not be compatible with all those xrpc methods. For example, we just cannot call repo.getRecord for non-JSON data, sync.getRepo won’t work when underlying MST structure is different to support revision history.

I’m not super sure about fetching/syncing part. I think it’s better to allow custom methods for maximum flexibility but obviously that will force to make similar sync protocol for each non-PDS services. As that’s what Tangled/Streamplace/Germ/Roomy is doing, I guess it’s fine..? Even we don’t reuse existing common xrpc methods, having universal identifier spec for the user-owned data would still be valuable.

Plz share how you think about this concept!

6 Likes

Great write-up and thank you for the insight. The artificial limitations imposed by a single-PDS per user is not practical for many of the reason’s you’ve stated above. We can’t expect a single PDS to scale to the services that will be invented in the coming years. Even if a single PDS could evolve to provide these future innovations, it would require ALL PDS providers to adopt innovations simultaneously for an app to be successful at capturing the existing user base. And in a reasonable timeframe. That, of course, will never happen.

We’re seeing people looking for workarounds for what the URI spec already successfully accomplishes. I mentioned the need for the AT-URI to come into compliance with the URI standards in this discussion:

Adding a query parameter to the AT-URI would be compliant with URI and DID Document standards. You reference a DID service with the query ?service={service_name}. That is what we’ve adopted for NorthStar Social to be aligned with existing standards.

This addition of the DID document service to the AT-URI authority component would further deviate from URI (and DID document) standards. The @ is already reserved for user information (username:password) with specific implementations in URI parsers. I’d rather see the URI standard adopt DIDs in the authority with the emergence of a “DNS” for DID lookup (something greater than the PLC directory). This makes more sense since the authority can easily adopt the did:xxx:abcd.. in place of hostname/ip:port without much headache for parsers to adopt universally. Two colons vs one colon. A URI compliant workaround without such a standard would be encoding the DID colons in the AT-URI.

3 Likes

Thank you for the detailed feedback! I wasn’t aware of that there was already a try, good to know!

This addition of the DID document service to the AT-URI authority component would further deviate from URI (and DID document) standards. The @ is already reserved for user information (username:password ) with specific implementations in URI parsers.

I thought it would be fine and reasonable to use the reserved user information part to represent service name.

Generic URI Compliance

  • Userinfo: not currently supported, but reserved for future use. a lone @ character preceding a handle is not valid

source

Because the authority part already represents the user and we won’t have sub-user, I think it makes sense to think it as “service Foo at user did:xxx:bar”. Yes, I’m proposing to extend the current AT-URI syntax. All existing aturi parsers only expect blessed type ignoring the query part anyways. both {service}@ prefix and ?service={service} suffix will introduce breaking change to existing at-uri parsers. I haven’t seen a real use-case of full AT-URI syntax yet, so I thought it would be fine to update the exact spec now.

Though this is just a stylistic choice so ?service={service} is also completely fine. I just found out using userinfo is possible & imo better looking.

Let’s slow down here. Boltless, are you aware of the work being done on permissioned data which would support group collaboration?

Besides read ACLs, if you’re trying to support multi user collaboration on data structures, the general thinking is that you would put an API around the records — at the application level. This shouldn’t require a modification to the PDS or any URI syntax, just coordination around the wrapping server.

For revision history, you just need to write that data as records as well.

In this case, instead of writing just the latest state of an issue to the PDS, like Paul is saying, I think you would want a separate record for each revision.

You could have a record that represents the persistant issue ID, but where the contents of the issue is actually stored in revision records, each of which links to the issue that it is a revision for.

That means that most-likely you would want an AppView that keeps track of the latest revision for any issue, so that you can easily query that from clients.

I think this is also a job that falls to the AppView.

I think generally, if you have data that needs to be derived from some raw data, the raw data is what should go on the PDS and it’s the responsibility of the AppView most of the time to convert that to a more usable form if necessary.

For instance, I think it could possibly be make sense to store git commits, etc. using CBOR bytes fields or blobs in the ATProto repo, but then have an “AppView” of sorts that would serve the normal git protocol on top of those records.

I’m not 100% sure that would work very well, and it might be better just to store the commit on a separate server, but I think generally projections should be done in an AppView.


Now I’m not necessarily saying that it wouldn’t be useful to have an AT-URI that allows us to link to other services.

I think that there are still going to be times where having other services is necessary, and having a standard way to link to them seems smart.

Adding somewhat unfamiliar patterns such as service_name@did:... does feel like maybe not the best idea to me. I’ve seen a charts showing just how diversely different URI parser behavior can be and even if something is spec compliant, doing something that is not common seems like it could easily cause trouble.

Maybe that’s not the most concerning thing, but does seem like motivation to go with whatever the most standard thing that works is.

I wasn’t aware of it! I’ve seen WG Private Data by its name and thought it was only about private data owned by single user. Seems like I was missing the discussion for this whole time… Sorry for the loud noise.

the general thinking is that you would put an API around the records — at the application level

I’m not sure if I understand this correctly. What’s the application providing API around the records here? Appview can’t always have a write access to arbitrary service. When we start designing wrapper around PDS, then this application wrapper will need full control over users PDS, so we need custom PDS implementation. Technically AppPassword can allow that, but it is too easy to accidentally break the connection.

Thank you for the feedback.

You could have a record that represents the persistant issue ID, but where the contents of the issue is actually stored in revision records, each of which links to the issue that it is a revision for.

Then we need a place to store those issue ID records to be a single source of truth, and that cannot be repo owner because we want to support repo owner migration. The bucket should hold the group data here. I guess this is pretty similar conclusion to roomy’s case.

For instance, I think it could possibly be make sense to store git commits, etc. using CBOR bytes fields or blobs in the ATProto repo, but then have an “AppView” of sorts that would serve the normal git protocol on top of those records.

Yeah, fair take. The JSON representation should always be treated as a projection, not source.

Adding somewhat unfamiliar patterns such as service_name@did:… does feel like maybe not the best idea to me. I’ve seen a charts showing just how diversely different URI parser behavior can be and even if something is spec compliant, doing something that is not common seems like it could easily cause trouble.

As I said, I thought it was fine to use that place because official document says reserved for future use I’m not super familiar with common uri parsers, but if it’s that hassle to use, ?service={} seems like a solid solution.


@pfrazee.com After reading some articles from WG: Private Data, I see why it went different direction. It’s kinda off-topic, but I’m listing them down here. I’m still in progress of the catch up so I might be still missing some. I’m not good at explaining so forgive me yapping a bit.

NOTE: To clarify, I’m not representing the Tangled team’s opinion here, these are just my personal discoveries.

Tangled is a collaboration platform and it means more than having a group. It is pretty common to have multiple authors for same record, and sometimes, it is fine to stress owners to hold all the data (e.g. we don’t really distribute git commits for each authors.) Also we cannot always just replace the referenced record. That approach makes sense for issue title/body changes, but not for label edits or issue state changes.

With the requirement of bucket-owned data and collaborative records, I still see off-protocol records as a more realistic solution.

Ah yeah, I think you mght be coming from a similar place as Roomy.

For example, in the latest permissioned data diary:

Daniel mentions two options colocated where the group’s data is stored on alices PDS and everybody contributes to it there, or partitioned where the data is spread out across all contributor’s PDS.

We at @roomy.space kind of felt like it was missing the middleground between the two where the group’s records are colocated, but on a dedicated community PDS as opposed to somebody’s personal PDS.

We’re feeling like that is the best option for Roomy at least, where we are thinking of trying to build a kind of community PDS.

That’s not what everybody wants, such as Northsky having I think a more broad use-case for private data that is still shared widely and makes sense to be “personal” data not “community” data deserving it’s own community PDS.

My current thoughts on how to accomplish that are rough drafted here:

We’re still in the musing phase right now, but we have pretty much decided to start trying to make Roomy as ATProto native as absolutely possible, and since we need private data that means we will probably end up experimenting with our own proposal for private data on ATProto based something along those UCAN lines.

3 Likes