Dave Nash Private Content RFC

https://github.com/knasher/rfcs/blob/main/atproto/001-private-content.md

This RFC proposes a mechanism for private content in ATProto by modifying the data flow between Personal Data Servers (PDSes) and AppViews. Instead of transmitting private content through the Firehose, PDSes would send metadata notifications for interested AppViews to receive, which can then fetch the actual content directly from the PDS using existing OAuth authentication.

@davenash.com has written up an RFC for private content and is intending to implement / prototype:

2 Likes

The high level flow is a good starting point to figure out the requirements / choices (copy paste from github):

  1. Content Creation: User creates content marked as private on their PDS
  2. Metadata Broadcast: PDS sends metadata notification through Firehose (not the actual content)
  3. Authorisation Check: AppView receives metadata and determines if it holds valid access tokens with relevant permissions for the content
  4. Direct Fetch: Authorised AppView fetches actual content directly from the PDS using OAuth
  5. Content Handling: AppView processes and displays content according to privacy settings

So this has public metadata, which will be appropriate for some applications. I feel like this public metadata is probably going to be a strong yes/no depending on the use case. Would love to hear from others on this specific point on whether this is OK for them or not.

@davenash.com can you go into a little detail of what you mean by metadata? e.g. does it disclose title or just a CID or ?

Note as well that Dave is proposing a com.atproto lexicon – that’s what others should probably using for now, for use by any app on atproto (that is, not just for your app lexicon).

  1. Followers Only: Content visible to approved followers
  2. Mentioned Users: Content visible only to mentioned users
  3. Custom Lists: Content visible to specific user-defined groups

I like how simple this is. @davenash.com are you intending to use ATProto built in lists for 3?

Alternative 1: Encrypted Firehose Content

Encrypt private content before broadcasting, with keys shared only to authorised AppViews.

  • Rejected because: Key distribution is complex and doesn’t scale well

Alternative 2: Private Content Relays

Create separate relay networks for private content.

  • Rejected because: Adds significant infrastructure complexity and cost

Oh, this is great to capture – not quite sure how we would kind of list these and discuss them over time.

And yeah, I think regardless of metadata transfer, I think either existing relays OR the appview itself would/could be responsible for fetching the data.

2 Likes

I agree with Boris’ analysis here. I think for use-cases where the metadata leakage is not a major security concern, using public records for signaling is a good option.

The fact that this could work well for some cases but not for others suggests there’s some value to having flexibility in the private data schemes.

3 Likes

Yeah, my partial private/confidential record source proposal is pretty well aligned with what @davenash.com has been talking about and recommends. I think all of this underlines a common feature that we need which is the ability to get a record using the existing XRPC function names from a service that is not the default ATProtocol PDS.

2 Likes

Hey all, that’s for the responses.

@davenash.com can you go into a little detail of what you mean by metadata? e.g. does it disclose title or just a CID or ?

I envisage the metadata containing only the necessary information that the AppView would need to pull the data from the PDS. A concern I have is that while the content is private, making the metadata public is problematic from a privacy standpoint; so limiting it to only essential, non-human-readable fields makes the most sense.

I like how simple this is. @davenash.com are you intending to use ATProto built in lists for 3?

Yes, built-in lists on ATProto could be used for this. We might want to consider whether these should also be privately held on the PDS, rather than their changes being transmitted publicly.

To your last point, @bmann.ca, I’m interested in hearing what folks think about how the relays could play a more active role in handling private data. My goal for the RFC (while wanting to provoke a discussion) was to try to keep the changes needed to implement this confined to the smallest number of components, but I’m also aware that my proposal has its drawbacks (such as placing more load on PDSes). A discussion on whether this could be mitigated by involving the relays in some way would definitely be welcome.

2 Likes

Does this means that authorised AppViews will have access to the user private data?

This is not far from the current situstion where Slack or Notion know everything is happening in 90% of the global digital businesses, but I was hoping we could offer an alternativewhere private data can only be unencrypted on Clients.

Also, what it would mean from a regulatory PoV?

Yes, Apps authorised by the user will be able to access the user’s private data.

Could you expand a little more on your question regarding regulation?

Could you expand a little more on your question regarding regulation?

If there is a way for an AppView to not be able to manage any private data in clear, then things like GPDR would not apply.

IDK the feasibility - so no criticising, just observing!
I recently found out Peergos is exploring that concept and they also have a proposal for ATProto FYI: @peergos.org on Bluesky

so depending on the architectures being discussed, data is still ultimately owned by the user on their own PDS.

various servers and app views would auth or aggregate and have ToS of their own, but the core premise of ATProto is users own their data.

I don’t know how that nuance applies to GDPR other than app views would need to delete shared data on request.

Can you say more about how you see GDPR causing more work here @erikologic.bsky.social?

I bet you know this @bmann.ca but here is a good ref anyway:

The UK GDPR sets out seven key principles:

  • Lawfulness, fairness and transparency
  • Purpose limitation
  • Data minimisation
  • Accuracy
  • Storage limitation
  • Integrity and confidentiality (security)
  • Accountability

Also this:

Yes, as you said, a ToS that clearly states the usage of data by the AppView, as well as any potential providers that the AppView would use, including if we store/process data on external services (storage, database, O11y, analytics, etc).

Additionally, adhering to data standards and practices can be cumbersome to implement for teams that lack the time and/or experience.

It is not a blocker, given that it is standard industry practice in Web 2.0. Still, it is an overhead because it requires legal support & compliance, implementing specific techniques like PII individuation & masking, following up on data takedown requests, etc, which are known to be challenging.

I mean, in the companies I worked for (corporates in London), when implementing services, standard practice was to stay the hell away from touching unanonymised customer data because of the pain points that would bring. I saw teams losing some good time processing a data removal request for a distributed application spanning across AWS and a few more services. The most pragmatic approach would have been using crypto-shredding but it requires knowledge and proactiveness - also, it is unclear how compliant that technique is.

the core premise of ATProto is users own their data

If AppViews start caching your private data locally, does this statement still hold?

In the example I brought before from Peergos, they treat both PDS and AppViews as adversaries. Again, IDK how applicable, but I see advantages:

  • increase trust on the user side, because data leaks are basically impossible
  • increase operational efficiency by not having to put in place all the measures to support GPDR and other data protection frameworks

Do you think this is fair?

I think that some people have encryption as a hard requirement and others don’t :grin:

See Strong urge to rename this working group to permissioned data

Yeah, I saw that and didn’t make the connection, good point.

There is probably space for coexistence and evolution I guess…

I’m also in camp “encryption”, for what it’s worth. Private data should be private data. The permissioned access bit for me is just a way to declare who has the permissions to potentially decrypt my private data.

Also GDPR: all I can say on this is that getting 100% GDPR compliant is a pain in the ass. We’ve done it at work (and still not 100%) and that involved a not insignificant chunk of change and a few lawyers. Although this isn’t strictly related to private/permissioned data - data is data, and if you operate a PDS, you’re responsible for it whether it’s encrypted or not, I think.

Then again I’m not a lawyer so… my 2 cents.

1 Like

My understanding for data laws is that encryption is not the E2EE kind, and that (i.e) if I use a cloud vm, most of those are compliant because my provider is managing encryption at rest.

The harder part of these compliances is the paperwork and processes you have to put in place. Also with GDPR / ATProto, a user may wish for an app view to delete their data, not the PDS, which gets at the increased complexity of compliance

1 Like

Nope

You are responsible to clean up the AWS CloudWatch logs if you have logged PII there from your application.

A data removal request would require you to go search each entry and clean up manually. What people usually prefer to do is drop the whole log stream and loosing part of the observability.
Now if you have those logs replicated in a bastioned account (which is part of the AWS Well Architected Framework) and/or you replicated them to an ElasticSearch cluster to consume from Kibana, you will have to follow with them as well.

This is just in case you forgot to mask PII on your logs, imagine all the rest, e.g. https://event-driven.io/en/gdpr_in_event_driven_architecture/
BTW as mentioned, is not clear if cryptoshredding is GPDR compliant

I think you may be conflating my points, one is on encryption and the other is on complying with GDPR requests

  • When I use a PaaS, they encrypt the data at rest and in transit, including the VM disk, the shipping of logs, and the storage of logs. I do not have to put luks on the VM and encrypt logs myself before shipping them
  • When I get a GDPR request, of course I have to clean up my stuff. I am also supposed to propagate that notice to other data processors I use. Now, if I run a PDS, do I know all of those data processors? Has the user given permission to apps the PDS doesn’t know about? What if the user asks an app to delete their data? Should the app delete it at the PDS or is the user switching to another app and wants this app they no longer like to no longer have access… The complexities have increased with user data sovereignty.
  • Being compliant with GDPR, SOC2, HIPAA takes a lot more effort on the people processes than it does on the technical side. Went through this recently with a healthcare ai startup.

When I use a PaaS, they encrypt the data at rest and in transit, including the VM disk, the shipping of logs, and the storage of logs. I do not have to put luks on the VM and encrypt logs myself before shipping them

”For example, you can use the AWS Encryption SDK with an AWS KMS Key created and managed in AWS KMS to encrypt arbitrary data.”

Just to clarify, you prev mentioned:
”if I use a cloud vm, most of those are compliant because my provider is managing encryption at rest.”

The answer remains no. You need to know the systems you are using and understand, on a case-by-case basis, how to adhere to the regulations in place. AWS has 90% of the market and is telling you, “We are giving you the option, but this is your responsibility”. You might need to luks /var/log if your provider doesn’t give a similar option out of the box.

the other is on complying with GDPR requests, of course I have to clean up my stuff.

FYI, GDPR is not only about data removal requests.
It also dictates who can access your data and how, e.g., A guide to data security | ICO, among other things. It’s an extensive framework and extremely complicated to navigate.
e.g. in a company I worked with, we would have stored info (DBs, logs, etc) separated in different accounts per workload, or logically separated in the same account including using not cross accessible KMS key. The company understanding of GPDR data processing was that payment service people/systems shouldn’t be able to access an insurance policy details like who is the insurance beneficiary because you don’t need to know that information in order to process payments, only a few order ref would be communicated across.

Being compliant with GDPR, SOC2, HIPAA takes a lot more effort on the people processes than it does on the technical side. Went through this recently with a healthcare ai startup.

GDPR is a mandatory law for every global company providing services to EU / UK customers.
SOC2 and HIPAA are voluntary certifications; they are also quite expensive and unlikely to be chased by many ATProto experience builders until we have a clear monetisation strategy.

We shouldn’t conflate the two :slight_smile:

BTW, I don’t want this to put a hold on your work!
I arrive at a similar conclusion in parallel ways, and we need to move ahead in some ways.
I just wanted to raise that having e2ee would provide a couple of fewer headaches.
As mentioned, this stuff is industry standard; we kind of know the solutions already.

Does AWS not do these things for their customers? I left a long time ago for Google Cloud because the DX is superior and the costs lower.

small not for clarity, AWS has around 30%, Azure 20%, Google 10% of market share

This topic won’t hold up my work, as it is limited to the PDS and it would be on operators of PDS and Apps to deal with GDPR requests and encryption if their hosting provider does not handle it for them. The nature of the system does make for cases I have not heard before w.r.t. GDPR

Yeah make sense, AWS is crap I agree.

small not for clarity, AWS has around 30%, Azure 20%, Google 10% of market share

Yeah 90% is a stupid figure, dunno where it came from, sorry

This topic won’t hold up my work, as it is limited to the PDS and it would be on operators of PDS and Apps to deal with GDPR requests and encryption if their hosting provider does not handle it for them.

I think we are violently agreeing on this one.

The nature of the system does make for cases I have not heard before w.r.t. GDPR

That’s an interesting statement. It would be beneficial to understand the implications this proposal would have with respect to current law requirements, e.g., https://g.co/gemini/share/99c3beb15004 , and provide guidelines for Private Data-enabled PDS/AppViews operators. I will be happy to propose a legal analysis within the Eurosky team, once we have a final draft (and if we manage to secure the necessary funding).

e.g. According to Gemini, PDS operators might need to get users to sign ToS:

The PDS operator must have a legally binding contract with the user (or the entity representing the user) that sets out the subject matter, duration, nature, and purpose of the processing. (Article 28(3))

If the PDS uses another service (e.g., a cloud storage provider) to handle the user’s data, it must obtain the Controller’s (user’s) prior written authorization for that third party (sub-processor).

BTW, just FYI, GDPR doesn’t end with data removal requests and encryption.
As I mentioned above, there is data minimisation Principle (c): Data minimisation | ICO and a few more requirements.

The British GDPR-equivalent agency, the ICO, has a well-designed website that can be skimmed through in a few hours: UK GDPR guidance and resources | ICO
I went through it some years ago and haven’t had to navigate GDPR-related data for a couple of years now, luckily. There is probably plenty of stuff that I’m not connecting ATM.

Again, I want to reiterate, I’m not trying to bash your proposal. I saw your presentation in the WG, it makes sense, and we need to start from somewhere. I’m just trying to contribute with my experience. Also, I never said GPDR to be a blocker, but only made an initial observation that with E2EE we would avoid a few headaches. We can sort them out.

with E2EE we would avoid a few headaches

I look at it as tradeoffs, E2EE comes with headaches too, notably key distribution. Not sure where you stand on the debate around here w.r.t. E2EE as a requirement, I’m hoping we can all come to common ground that both style systems will be needed. I’m personally working on the non-E2EE system because that’s the one I need for Blebbit. We’ll have a TOS for users setting up communities that can cover our legal requirements. I also think it should not be too much a problem to build on the “permissioned spaces” work and store cypher text, there are some use cases that could benefit from a few records like this versus wanting to be on a fully E2EE system.

That’s an interesting statement.

Yeah, I’m not sure if / what the GDPR has around distributed or federated systems. It may put more work on the user, i.e. they may not be able to just tell the PDS to take down the data, they may also have to go to the apps they have allowed access for, because the PDS may not know and certainly won’t have a contract with all the apps (re: OAuth dynamic clients)

btw, highly recommend Gemini Deep Research for AI answers like you are seeking with the link above. It will go out and get more docs, read about ATProto, and provide a much longer and robust response. (still within the limits of what these things can reasonably be expected to do)