Licensing terms for data in the Atmosphere

I’d like to raise a fun and interesting topic, and that’s intellectual property and content licensing.

The data that flies around the Atmosphere is all visible and public. We’re at a stage within the Atmosphere where services are beginning to grow, and we’re beginning to see lexicons and data that express creative works that extend in length beyond microblogging. I think that the Standard.site lexicons are a great example.

But these creations lead to further questions. Who has permission to republish data that’s posted to the Atmosphere? Should we be making licenses to publish data clearer?

Should lexicons like Standard.site include licensing fields? Should we be recommending particular licenses for publishing to open social ecosystems?

Additionally, who in the Free Culture ecosystem should we be tagging into this conversation? :smiley:

11 Likes

There was a proposal regarding user intent flags a la robots.txt that could feasibly be extended to more granular permission sets. I’m not entirely sure what the current status of this proposal is, though.

1 Like

Would be nice to have this incorporated:

RSL enables multiple publishers or content owners to reference a shared set of licensing terms, such as open source or Creative Commons license frameworks, through the <standard> element. By pointing to a common license URL, websites can declare that their content is governed by the same standard agreement. When a client application (e.g., an AI company) accepts the terms of a standard license, it gains access to all content covered by RSL licenses that reference that shared URL through their <standard>element.

In addition to supporting established frameworks, the ability to easily define and implement shared licensing terms gives content owners the ability to unify their voices and negotiate collective licensing agreements with AI companies and other applications that require access to large collections of digital content, including websites, books, videos, and proprietary datasets.

They don’t make it very clear but there’s an option to express ‘AI (training) prohibited’ which is what I’d want for my content. Would’ve liked to have a slightly more nuanced prohibition that makes an exception for Digital Commons usage in research and academia though.

5 Likes

I can’t remember if I’ve put something out publicly, but I think media objects (blobs) in particular should usually have license and attribution metadata attached when designing lexicons.

Two metadata fields:

  • licenseUri: string, format=uri, optional. standard/normalized URL(or short URI) for common licenses, or can point to more bespoke license/policy page
  • attribution: string, optional: more free-form field for listing a name or source of the media. can also be a URI that links back to the original source

These fields would go alongside things like alt-text. If there are multiple pieces of media in a record, it should be possible to attach different metadata to each (same as with alt-text).

6 Likes

I wrote that proposal and still think it is the best way forward for public atproto records/repos in the general case. Media files (blobs) maybe need more granular metadata (see my sibling post in this discussion).

I registered a domain and created a demo for this system a while back: https://demo.user-intents.org/

I’d love to collaborate with folks on getting that project shipped and adoption in the ecosystem. There have been a bunch of AI and monetization debates which have made it hard to get a more generic system out; focusing on the non-AI parts might help move it forward.

3 Likes

I have a lot of thoughts and very little time right now. But some quick bullet points:

  • All of this is on uncertain legal ground right now, which makes it hard to give legal advice and harder still to give creative, optimistic legal advice.
  • I have talked to @bnewbold.net and his “less is more” instincts are fundamentally sound here. May make sense for those two simple fields to be a recommended baseline for new lexicons—it is imperfect but hard to go wrong at least starting with those two fields.
  • To the extent lexicon fields can be typed (I don’t know the answer to this?) It may be helpful to specify that the license field prefers the SPDX license names, since “it’s just a URL” often results in unparseable messes as people slap slightly mangled copies of licenses on their own web pages.
  • I suspect for some lexicons, it may be worth having specificity about which fields are actually being licensed, as you might want (eg) identity to be tightly held but content to be broadly licensed.
  • RSL is interesting, but it is early and you probably don’t want to lock in a particular standard this early in the game. Flip side, though, I bet they’d see outreach from this community as an interesting and welcome test case.

Hope these are some helpful notes. Will try to check back in again soon.

[For those of you who don’t know me, I’m a Licensing Guy - former Open Source Initiative board, counsel to Mozilla and WMF, currently Creative Commons board]

10 Likes

AI-pref is still in limbo/derailment state at IETF best i can tell, so i would recommend to wait to see where that data model lands and don’t make something out of alignment with it. and yeah at the lexicon level probably makes sense.

i have been ruminating on how an iscc field in a lexicon would work, for content whose author(s) want its authorship, licensing, and/or ai-prefs to be “findable” (vector/perc hashing) across edits and metadata stripping… seems to be the same metadata pattern, and there too metadata at the lexicon/MASL level makes the most sense imho

1 Like