Taxonomy for ATmospheric Lexicon and App Directories

Hi Folks, Dorian again. You may remember me from such poasts as the one a few weeks ago about the protocol-agnostic PDS I plan to cook up. Well I’m back as I’ve been summoned by Christian to tease you all with another one: I know there’s been recent chatter about creating an app directory (or several?), so what might that look like?

Even better: what if the metadata that drove such an app directory was on the protocol—so anybody who wanted to make their own app directory could make one? What if we used that same metadata to give the same treatment to lexicons, so ATProto app developers could browse by theme or vertical?

App Listing Lexicon

So in the first instance, I’m imagining a lexicon that can express the data that goes into something like an app store page, but—and this is the part that’s got me vibing—also a lexicon for describing the concepts that make up the app directory’s taxonomy, along with how those concepts relate to each other. The result would be a set of concepts—represented as addressable data objects on the network—with organic consensus around what those concepts mean. These would form the elementary building blocks of a shared public asset which could be used in all sorts of applications.

Standard dot site all the things

What I’m proposing here, at root, is a set of lexicons following the example of Standard.Site which primarily ladder up to real-world user goals, intentionally designed to span across apps. There are levels of scale to consider beyond a single taxonomy for a single app directory.

What’s really great about standard.site is it’s an ad-hoc group of companies who’ve gotten together to create a very simple pair of lexicons (so far) that express what amounts, to a first approximation, to the container and individual entries of an RSS feed—except in ATproto-ese. What this affords is anything reasonably document-shaped—blog, newsletter, forum post, or even old-fashioned book or news article—to be addressable on the protocol. They list the benefits thusly:

  • People writing indexers don’t have to wrangle a whole ménagerie of formats,
  • End users can move their content between providers,
  • No single entity owns the standard,
  • Governance (I paraphrase) by rough consensus and running code.

In that vein, I am imagining, in the first instance, an app listing lexicon. Maybe crib schema.org for inspiration. App authors could publish conforming data objects, and indexers could pick them up and read them. It’s easy to imagine an ecosystem (ratings, commentary, release notes, playthroughs, et cetera…) popping up around this core entity. A feature of this, though—and we can debate the merits of this—is that the app listing would also list what lexicons the app uses.

This is all well and good, but if you wanted to get into the app indexing business, you might ask why you should care about what lexicons the app uses. Well, for one, developers will care what lexicons an app is introducing into the ecosystem, but I’m also going to argue that it’s actually the lexicons that are the optimal place to attach a very important piece of information: the thematic category(ies) to which the lexicon—and by extension, the app—belongs.

Let’s say you have an app like Strava, but it publishes its waypoint data to the protocol. First off, that lexicon (say, latitude, longitude, timestamp, previous waypoint) would be eminently reusable for any other app that traced a path through physical space, or otherwise dealt with that data. As such, we could imagine this lexicon being categorized under “geodata” or something. Now, let’s say this app also had some kind of “session” lexicon that maybe recorded the calories you burned or something. Well, that’s unambiguously a “health and fitness” lexicon. But, if you put the knowledge that the app uses these two lexicons together, you can pinpoint a much more specific category of app: a run/bike/paddle/ski/etc tracker.

Protocol-wide tagging taxonomy

The remaining component of this proposal to discuss is the thematic categories themselves. This is an opportunity to create a public good that could have wide-ranging ramifications throughout the entire ecosystem: the concepts themselves that make up the categorization scheme should be addressable data objects that live on the protocol. For one, there’s the mundane matter of synonyms and different representations—such as different languages—of the same concept. For another, we have the opportunity to assert consensus, through use by referencing it, what a given concept means. We can do the same for collections of concepts and the ways (broader, narrower, otherwise-related, not-to-be-confused with…) they relate to one another. In a sense what I’m proposing is a sort of protocol-wide, wikified dictionary/thesaurus, with the added benefit that the objects in question are all addressable, reusable, structured data.

Reality, of course, has a heck of a lot of detail, but this is something I have some experience with—namely SKOS. SKOS is a way to encode, publish, and share concept schemes. Unlike a conventional hierarchical taxonomy, SKOS is roughly set-theoretic, and its progenitors have already gamed out how concepts—and concept schemes—interact. You see it used in things like enterprise CMS products, and representing all sorts of taxonomies in libraries, ecommerce, and beyond. Note: I’m not suggesting, per se, that we use SKOS—this is ATProto after all—I’m just underscoring that there’s a mountain of prior art to draw from.

So to recap the outcomes, at the surface level we have something that satisfies both user and developer needs (finding apps and lexicons by thematic category), and at the deeper level we have a durable public good (a consensus-driven, structured dictionary/thesaurus) that can have all sorts of applications beyond organizing apps and lexicons. This would be driven by a third level deeper, which is a lexicon for concept schemes.

This multi-tier project has a lot of ways to engage, from designing the specs themselves to authoring/curating individual concepts. I know it’s probably a little hairy so I’ll follow up with a diagram once I decide how such a thing ought to look.

10 Likes

I’m still reading and absorbing this, but the very first thing this is getting me excited about is what it could mean for lexicon governance. Instead of ahead of time decision-making that repeatedly risks drifting into something gatekeepy, this opens up governance as an exercise in collective sense-making. We can look at what kinds of lexicon usage actually emerge on the network through reference and reuse, rather than trying to decide what should exist before people build it.

What I keep coming back to in my head is that most people who create lexicons won’t be trying to participate in any kind of governance. And honestly, that’s probably the best possible starting point for lexicon governance.

Your proposal treats massively messy lexicon creation as raw material rather than a failure mode. Overlapping lexicons, unexpected reuse, and even incompatible interpretations are all signals about how people are actually trying to solve problems. A system that can observe and relate those patterns can turn that mess into something legible without asking anyone to change how they work up front.

I also really like how this shifts the burden away from lexicon authors and onto interpreters. Indexers, directories, and curators can do the work of saying “these seem related,” or “these are often used together,” or “this cluster of lexicons points at a shared user goal.” This feels much healthier socially than trying to canonize certain schemas as correct or get agreement about them first. I’m not saying this as a gotcha, but ActivityPub could never, at least at this moment in its history. ATproto could point the way towards a form of protocol governance that is interpretive and iterative rather than preventative. I want that for ActivityPub. It would be amazing to prototype and test these ideas in ATprotocol.

I’m reminded of the extensible web manifesto here, especially the idea that platforms evolve best when meaning and practice emerge through use rather than decree.

Another thing I keep noticing is how this makes adjacent work more visible without requiring coordination up front. If someone can see other lexicons operating in the same conceptual neighborhood as theirs, that creates awareness. Sometimes that leads to collaboration. Sometimes it just changes how you think about your own work. Either way, it increases the chance that people working on related problems can find each other without needing a working group, a process, or a shared decision first.

And the more I sit with this, the more I keep thinking about ActivityPub. Over there, the units people extend are ActivityStreams object types, activities, and whatever vocabularies individual implementations end up layering on top. Advancing anything new through that layer is hard in ways that aren’t just technical. Work can get stuck on unresolved issues, and when no one is positioned to move them forward, it can start to feel personal even when it isn’t. But I digress.

Wait, no! I don’t digress! Social web governance has struggled precisely because it’s often been treated as technical. What you’re outlining here points toward governance that’s socially legible and therefore socially actionable. Go it alone or go it together, lexicon governance becomes a learning activity at every scale.

Interested in whether this resonates or not for everyone else here.

6 Likes

I kept going to pull quotes but just like, all of this. lol

Something really basic like a taxonomy of concepts has the potential to skew heavily social, because at a technical level there’s almost nothing to it: you have an identity with some labels (synonyms, language-tagged strings—not sure how those’ll be implemented), you might have a definition, and then you have a handful of semantic relations, and the concept scheme class to yoke them together. (Again, I’m using SKOS as a template for the actual structure itself.)

I just had a call yesterday with @schlage.town and @awarm.space, and they said they could totally use such a taxonomy for their tagging infrastructure. (I also spoke to @ronentk.me and @wesleyfinck.org and it would apply to Semble as well, though ostensibly a day early, as this idea wasn’t fully formed at the time.) Essentially anything that uses tags could tap into this infrastructure.

It could be quite the phenomenon if anybody with an AT identity (what are we calling these again, @christian.bsky.social?) could push to the network:

  • a concept
  • an amendment to a concept (definitions, synonyms, translations…)
  • a semantic relation between two concepts (A broader/narrower/related to B)
  • concept schemes/collections

…and then different authorities (apps, sites, whatever) can just select from these—even using entire other concept collections from other curators as tributaries—and the ones that get used will gain currency. That would be the new thing here.

I really have to say, after using SKOS for over a decade for various things, it’s really powerful to have a concept scheme handy—not only just as a receptacle for glossaries/thesauri, but also for inferencing. I use it for all sorts of things, like resolving audiences to content. But then, with SKOS there’s no inbuilt mechanism (bracketing the Web at large) for community involvement.

Also note, I can already see how this could be used for trolling/abuse/disinformation, so governance around what subset of these objects reach the surface is going to be key.

5 Likes

Love this. This feels aligned with community lexicons generally at a high level, especially if they are just a resource that is available rather than something prescriptive or authoritative. I personally have had a fixation lately on how we might make the idea of reviewing different kinds of niche content with specific needs as universally valuable as possible so this also seems valuable to me through that lens. Both in terms of reusability and in terms of discoverability.

4 Likes

Indeed. One dimension I’ve been thinking about with respect to this proposal is how concepts (as addressable records) can provide a template for cultivating resources in common of specifically structured information. What’s attractive about a taxonomy of concepts is that a) they’re durable in many important respects, or at least can be—though go stale in interesting ways, and b) from a strictly technical perspective, there’s barely anything to them. This makes it useful as a proving ground for more complex social dynamics around shared information resources.

(Rather, it’s an opportunity to bring to bear our several decades of internet usership—and generations of lexicography and librarianship—to imbue such a resource with everything we know we know, so we can get to smoking out the pitfalls we aren’t aware of.)

I went to draw a diagram of this business, but figured I needed to think about what was being drawn more, and ended up with this (my kingdom for robust transclusion):

Open for notes on use cases, etc.

Story time

App Developer

  • wants to advertise the existence of their app
    • uses some deployment tool to publish an “app store entry” record to the protocol
  • also not averse to promulgating lexicons they have authored to other developers
    • tool also scrapes codebase for lexicons used
      • prompts user to tag any new lexicons they have contributed

Aggregator

  • wants to create an “app store” for atproto-enabled apps
    • trawls network for “app store page” records
  • wants to categorize for easy findability
    • has paid to research audiences and tailor top-level categories
    • can just execute the transitive closure over the tagged apps and lexicons to sort them

Leaflet/Semble/etc user

  • has just done a post; wants to tag it to categorize it
    • one of the tags couldn’t be resolved, so a new one with that label is created and added to their PDS
    • tag is added to their personal concept scheme (?)
      • user is invited to connect the tag to others

Aggregator/Leaflet/Semble/etc admin

  • wants to make their content more findable; better paired to audiences
  • manages their own concept scheme for UX (and QA) purposes
    • integrates entire concept schemes from trusted curators wholesale
    • also vetoes/deltas against them
  • this is somebody’s job at the company

Random curator/librarian/lexicographer

  • runs a website (or is a moderator on some community)
  • has cultivated a reputation as the definitive authority for a given niche (e.g. a fandom)
  • ultimately views it as a social duty but would be lying if they didn’t enjoy the cred
  • plus has Opinions™ about it
    • uses some heretofore-unseen tool to create a concept scheme
      • pulls in concepts from all over atproto in addition to writing their own
  • this is a non-trivial volunteer effort
    • evenings and weekends tweaking the entries and topology
      • (perhaps via some out-of-band input from their own community)

Drive-by contributor

  • wants to help the community, feel useful
  • or not even necessarily; even just motivated by the affordance
    • uses whatever tool is handy to add value to an individual concept
      • add synonym
      • add translation
      • add definition
      • add example
      • add semantic relation to another concept
        • propose removal of semantic relation
          • (destructive changes should not be unilateral outside your own scope of influence)
      • propose split
      • propose merge
  • these are structured micro-interventions, analogous to fixing a typo or adding a link in Wikipedia

Record types

App store entry

  • this I think might actually be better off considered separately
    • in particular, things like ratings and testimonials that aren’t part of this record per se but will attach to it
    • plus all the attendant social dynamics around that
  • again, look at schema.org/SoftwareApplication for inspo

Concept

  • the whole idea of having a concept as an addressable record is to create a durable identifier off of which one can hang spongier things like text labels and definitions
    • again: synonyms, translations, acronyms/initialisms, other alternate labels…
  • moreover, attach via semantic relation to (the durable addresses of) other concepts
    • again: broader, narrower, related…
      • I would also add “not to be confused with”, ie when two concepts are related by the fact that they are often mistaken for one another, but nothing else
    • (although to do this in ATProto, I would be inclined to make the semantic relations part of the concept scheme/collection record rather than a property of concepts)

Concept scheme/collection

  • SKOS distinguishes between a concept scheme and an arbitrary collection of concepts
    • also ordered collections
    • not clear if this distinction is strictly necessary (modulo ordering) given some kind of “tributary” mechanism
    • one weird thing I have noticed, though, is collections (in SKOS) are orthogonal to concepts
      • ie, a collection with concepts as members is a different kind of thing and set of relationships than a concept and a bunch of immediately narrower concepts
      • a collection would be useful to yoke together a bunch of concepts that were found in a particular domain but not otherwise related to one another
  • in ATProto, any concept (or collection thereof) will implicitly have an author attached to it
    • it isn’t clear then what the distinction would be between a concept scheme and an arbitrary collection, other than for parity with SKOS.
    • SKOS concept schemes do have a notion of “top concepts”, which people tend to interpret to mean hierarchichally broadest, though arguably would be better off using them as genus-level entry points (after Lakoff).

Obvious issues

  • general abuse
    • spamming
    • trolling
    • brigading
  • astroturfing/manipulation
  • even people just mechanizing parts of this process to be “helpful”
    • even bigger issue now with LLMs
    • look at GitHub PR spam for a template
      • accounts write scripts to crawl repositories and issue thousands of trivial pull requests
      • they ostensibly do this to light up their “green grid”, in turn, presumably, to impress prospective employers

Analysis

  • if anybody can add a concept to the protocol, arbitrarily many will
    • concepts are invariably going to be more durable/reusable than posts
    • this makes it an attractive target for:
      • abuse and hate,
      • spam links,
      • trolling,
      • other garbage
    • that said, a concept can only really get meaningfully surfaced two ways:
      • tagged by a post (that itself gets seen)
      • being included in a concept scheme/collection (that gets currency)
  • there are invariably going to be fewer concept schemes/collections than individual concepts
    • (at least, one would hope?)
  • the A in AT refers to authenticated ie every record has a progenitor
    • knowing who minted a concept is a useful (inverse?) reputation signal
      • (inverse ie the known reputation of the person is a signal to curators to ignore the concept)
  • the (labeled) indegree of a given concept will tell you:
    • how many times it is being used to tag things
    • how many concept schemes/collections it belongs to
    • (these numbers can of course both be juked, so the inputs would have to be cleaned)

Mitigations

  • individual concepts get legitimated by being included into concept schemes
  • if you were curating a composite concept scheme, you could curate your tributaries and effectively delegate to them
    • if they in turn use tributaries, you could veto those relationships wholesale
    • (there may need to be some negotiation process among curators for eliminating circularities)
2 Likes

Yes! One specific thing that I intend on exploring with drydown.social is letting users create their own collections of house/producer and fragrance/product records for autocomplete and also follow other users to reference their data as well.

What I am hoping for is a proof of concept that people can create their own network of collections to mirror something like MusicBrainz or TheTVDB for any kind of niche interest and without that collection being concentrated under one centralized owner. That’s actually my biggest motivation now for focusing on review-related lexicons.

1 Like