The community.lexicon.preference.ai lexicon

ngerakines.me · April 22, 2026, 7:40pm

Hey everyone,

Picking up from a few weeks ago, I created a PR that I’d like comment on. I’m specifically looking for who would be interested in using this lexicon to interop:

github.com/lexicon-community/lexicon

Add community.lexicon.preference.ai lexicon (#72)

main ← ngerakines/community.lexicon.preference.ai

opened 06:51PM - 04 Apr 26 UTC

ngerakines

+219 -0

## Summary - Introduces the `community.lexicon.preference.ai` lexicon for dec…laring user preferences regarding AI usage of their public data - Decomposes AI usage into four distinct categories (training, inference, synthetic content generation, embedding), each with independent allow/deny controls - Supports scoped overrides via `globalScope`, `entityScope`, and `collectionScope` so users can set account-wide defaults and carve out exceptions for specific entities or collections ## Design Each preference is tri-state: allowed, denied, or undefined (omitted). The record at key `self` with `globalScope` establishes account-wide defaults. Additional records keyed by TID are scoped overrides that only need to declare the preferences they change — everything else falls through to the default. Consumer resolution order: 1. Entity-scoped override matching the consumer's DID or domain 2. Collection-scoped override matching the content's NSID 3. Global default at key `self` ## Related work - Complementary to [Bluesky Proposal 0008: User Intents for Data Reuse](https://github.com/bluesky-social/proposals/blob/main/0008-user-intents/README.md) - [IETF AI Preferences working group](https://datatracker.ietf.org/group/aipref/about/)

rude1.blacksky.team · April 22, 2026, 7:41pm

Myself and Blacksky would like to push this through as a working group and are committed to adopting this in our app in an interoperable way.

taurean.bryant.land · April 22, 2026, 8:14pm

I’d like to be involved in that. I took a first shot at a lexicon for identifying as automated but that’s largely useless without some way for users to communicate their own needs. I’d love to see this handled at a level that is not focused on Bluesky-based lexicons.

sherifea.com · April 22, 2026, 9:04pm

Hey, would like to be involved for Eurosky too. Cheers!

chaosgreml.in · April 22, 2026, 9:17pm

I’d like to be involved as well! Would be helpful for Charcoal.

posth.me · April 23, 2026, 7:15am

I’d like to participate in the discussion. Liccium would implement the lexicon if it makes sense. I also left a comment on Github:

github.com/lexicon-community/lexicon

Comment by sposth - Add community.lexicon.preference.ai lexicon

main ← ngerakines/community.lexicon.preference.ai

As mentioned above, the work at the IETF is highly relevant – not only with rega…rd to the vocabulary itself, but also the attachment mechanisms. [https://datatracker.ietf.org/doc/draft-ietf-aipref-attach/](https://datatracker.ietf.org/doc/draft-ietf-aipref-attach/) [https://datatracker.ietf.org/doc/draft-ietf-aipref-vocab/05/](https://datatracker.ietf.org/doc/draft-ietf-aipref-vocab/05/) Following the IETF meeting in Toronto a couple of days ago, a new editor’s draft is expected soon. It will include several improvements, in particular on the discoverability of content that has been opted out in the context of search. This is an important point. Many users may wish to opt out of AI training while still remaining discoverable in what the IETF may call “non-generative search” – meaning AI-assisted search that does not provide AI-generated summaries, synthetic answers, or other substitute outputs. In practice, this can be understood as a narrower form of the IETF Internet draft on display-based preferences mentioned already by @musicjunkieg. The use cases around RAG and inference – where content is used by AI systems after model training – are expected to be discussed at a future IETF meeting, likely in late summer. That discussion should help clarify whether the emerging IETF vocabulary will provide meaningful value for creators and rightsholders, who would like to have a say how content is used by AI systems post-training. Adding new AI preference expressions may be desirable, however it should be considered whether the AI model developers, AI system providers, or search engines will take them into account. I suggest a realistic approach in this regard. A second point concerns the attachment mechanism. Should AI preferences be applied only as general account-level settings, or should they rather be attachable to individual posts and media assets? This distinction matters. Content is frequently shared, quoted, or reposted by accounts that are not in a position to decide on rights reservations or permissions. For that reason, attaching such preferences solely at account level may create both practical and legal concerns. At Liccium, we are working on an asset-level approach in which AI preferences can be bound directly to the individual post and to the underlying media asset (blob) using ISCC fingerprints. This allows preferences to travel with the content itself, rather than depending only on the account or the platform through which it was shared.

bumblefudge.com · April 23, 2026, 8:19am

See previous discussion here

I see in Sebastian’s github comment that the issue of where to express licensing and/or AI terms has come up. I believe account-level is a bit of a fig leaf because the “unit of measure” for the web is still (and realistically will be for decades) the HTTP URL, not the at:// URI, and it is a “polite software” strategy to politely ask all webviews to properly crawl-to and expand and apply account-level metadata onto each display of an at:// URI…

the more extreme web3 position would be that if AI preferences and licensing terms aren’t signed by the private key of a public key in the DID Doc, you’re just inviting bad actors to impolitely ignore or strip that metadata, and you should keep these declarations not just in each URL but in each piece of verifiable data full stop. that’s where i’m increasingly convinced ActivityPub needs to go as Client-to-Server inches its way to production, and I would suggest it here as well.

ngerakines.me · April 23, 2026, 1:25pm

Read through your comment. I don’t have any objections really. To bridge the gap between “we should” and “this is how”, can you provide some examples of what specifically you’d like changed or incorporated?

rude1.blacksky.team · April 24, 2026, 4:39pm

I am working on getting an opt-out feature delivered in my current development sprint.

I am not personally seeing any objection to adopting community.lexicon.preference.ai – from what I’m seeing folks either want to include other preference options or to expand the preferences to other record types beyond the account level.

That does not seem to preclude moving forward with this as-is which would serve my short term needs personally and then these other use cases could be addressed in ongoing conversations or follow on efforts.

Is anyone opposed to merging the lexicon as is?

bmann.ca · April 25, 2026, 8:27am

Great call – the tradition is basically “does anyone have ‘over my dead body’ objections?” otherwise keep things moving.

pmcewen.bsky.social · April 25, 2026, 6:19pm

First, I think there are some issues with the proposal from a legal perspective in that the language is very mixed with regards to whether it is just a suggestion about what a user prefers or if it’s trying actually create some sort of legally enforceable set of constraints or allowing users to give affirmative permission for their data to be used if a law were to require that.

Saying they are “user preferences” to me implies that it’s simply what the user prefers and is not even trying to be legally binding in any way.

However, the phrasing “These schemas allow users to declare how their public data may be used by external consumers” seems much stronger and that it implies that it is legally binding.

I am not a lawyer, so I can’t really offer any advice on whether this sort of thing could actually place enforceable legal obligations on external data consumers (and if it can, it almost certainly depends on the country). However, if that is the goal, it seems pretty important to have a lawyer review this and add what is likely to be many pages of additional legal statements.

For example, when it comes to defaults or situations where a user has indication some preferences, but not all, it is possible that there is a meaningful legal distinction in some jurisdictions. Perhaps users explicitly opting out confers additional legal penalties when violated or legal use of the data requires users to take some affirmative action to consent beyond just accepting the defaults. It’s possible the inability to distinguish between those could create problems.

If this is just a sort of for fun set of suggestions that no one is even pretending might have any legal implications for the use of the data, then I think the language should make that more explicitly clear so that users understand this isn’t the basis for them to sue anyone.

Secondly, from the perspective of an AI builder who is trying in good faith to comply with these preferences, there is a lack of clarity that makes it extremely difficult. Most terms of service call for granting a license to the data in perpetuity to make things easy. That probably doesn’t work for this, but not having any listed time frame potentially makes it impossible to comply with. Most notably what happens when a user changes their preferences to opt out of something that they had previously opted into.

For example, being clear that a current user preference for their data to being used to train an AI model confers a license to train the model for 3 years from the date of access and then the right to use the model in perpetuity after it has been trained. Shorter times obviously react more quickly to changing user preferences, but if they are too short it makes certain applications, such as publishing research with an accompanying data set for someone else to replicate the results much tougher.

There are plenty of bad actors who will ignore the user preferences no matter what, so if there is any worthwhile goal of this, it should be to allow someone to build something with AI that they can affirmatively say is done completely with the consent of the users who generated the data. Right now, I think it still falls short of that.

pmcewen.bsky.social · April 25, 2026, 6:37pm

Rereading this after posting my comment also makes me realize that in many legal contexts, it may not be the text of the lexicon documentation that matters, but the text that was shown to the user when they indicate their preferences. If every site that allows someone to change the data for this lexicon uses different language, it starts to really complicate things.

At minimum, it seems like the documentation for the lexicon needs to include the text to be shown to users when they indicate their preferences. But if that language is ever updated, since it would be shown on a network of different websites that’s no one can force to update, the lexicon probably should store some indication of which version of the text the user was shown, if not the entirety of the text itself.

ngerakines.me · April 25, 2026, 10:41pm

Thanks everyone. This has been merged and is now live!