Localization on At Protocol

Thinking globally, acting locally, and at protocol.

What Is Localization?

Localization is the process of and technology for providing content in a locally relevant language.

Localization may be one of the most important set of standards on the internet.

In the early days of the web, localization was pretty problematic. Most sites were just assumed to be English, and this gave English a structural advantage in computing (at a cost to the rest of the world)

This advantage lasted between 1991-1999, and and those 8 years have perpetuated a fair amount of structural inequality in the world.

In 1999 RFC2616 was standardized, and this provided a way to cannonically describe languages and negotiate with a server to get the right language.

The Accept-Language header helps negotiate the language.
The Content-Language header should send the content language.

Accept-Language allows a semicolon delimited list of acceptable languages (and optionally) desired quality:

Accept-Language: da, en-gb;q=0.8, en;q=0.7

Per their example, it means I will accept Danish, then British English, then English, and I’ll weight the quality of a page that is British English as 0.8 and any English as 0.7.

All of this has been standard for a very long time.

Unfortunately, at present, at protocol does not generally provide a locale within any lexicon.

The few lexicons that currently provide a locale actually only allow a slim portion of this standard:

They allow a two letter language code.

This is quite a bit less than ideal.

Why?

Because dialects vary wildly.

For example, there are 48 french dialects. 28 Arabic dialects. 9 Chinese dialects. A native speaker of Mandarian may not be fluent in Cantonese.

Locales also provide key information about local relevancy. For example, if I am looking for news content in en-US, I’m likely to be more concerned about that than content in en-GB.

In short, **locale really matters to billions of people.**
Unfortunately, at protocol does not currently support those billions of people well.

What Can We Do

app.bsky.feed.post

The two-letter languages field in a app.bsky.feed.post is not sufficient to many of these problems. IMO, it should be, just by expanding the list of acceptable options. Allow any valid locale string. This is supported cleanly by JavaScript, DotNet, and most programming languages worth their salt.

Ideally, it would be named ā€œlocaleā€, not languages, but that ship has likely sailed.

newer lexicons

Newer lexicons can easily fix this by providing a standard field.

Again, it I believe this should probably be called Locale, but Language(s) is acceptable.

What’s important is that this accepts the full list of language formats.

This list is available here: List of ISO 639 language codes - Wikipedia

You can also get a local copy of this list in PowerShell by using the DotNet CultureInfo class

[CultureInfo]::GetCultures('allcultures')

app views

IMO, App views should honor the accept-languages headers and provide content language headers matching the currently viewed content.

You can check the currently acceptable languages with the navigator.language object in JavaScript:

Why We Should Care

Overall, the at protocol is trying to build a brighter/better tomorrow for the internet.

The first time the internet got this wrong, it impacted the rate of adoption for the internet across the world, and hurt billions of people by making them wait for prosperity to reach them.

This structural inequality contributes to the mix of programmers and internet users to this day.

In my opinion, if we want to build a brighter tomorrow atop at protocol, it is critical that we get this as right as we can as soon as we can.

The sooner we do this right, the sooner the rest of the world can enjoy a bright blue sky!

Please help do your part to make a better web for everyone (not just everyone that speaks English)

8 Likes

Oh, I’ve actually done some work with this! I should note that language subtags are supported on atproto - en-us and en-gb, for example, are fully supported! The main thing that’s lacking is a good in-lexicon format for translatable strings, which I’ve attempted to make for my project Tsunagite. Sadly Lexicon doesn’t have a good way to represent objects with effectively-arbitrary keys, so I’ve had to make it not properly validate while still describing the features human-readably. You can check out that schema here: schema/translatable.json at main Ā· tsunagite.dev/tsunagite Ā· tangled

3 Likes

This sounds like a place where the language could be taken into account when fetching a record, perhaps even as part of an at-uri expansion to include query params for various things, especially in read only cases. For example, how might one share a link to a specific language, version of a record, or page of search results

Fantastic to hear that this is technically supported at present! Let’s all do our part to socialize it!

The original prompting for this post was standard.site (which does not currently have a language/locale).

Your lexicon looks really cool!

I personally would love to help establish a great lexical standard here so we can all avoid having to re-run translations all the time (this sort of thing is a bit of death by 1000 papercuts in tech [especially when these days we’re asking LLMs to do it]).

Thanks for clarifying and providing feedback! Please keep it coming / keep on making cool tools!

2 Likes

@mrpowershell.com being a french-speaking native I wholeheartedly agree with your point. Have you contacted standard.site folks yet about this?

The language string format isn’t limited to two letters. It’s full IETF Language Tags: Lexicon - AT Protocol

The app.bsky.feed.post does limit the langs property to a maximum of three valid languages, which considering a post’s max length is 300 graphemes, you’re probably only going to fit a maximum of two distinct languages in a post. They could increase that maximum if necessary though

1 Like

Thanks for clarifying! Someone else had mentioned this as well.

Two notes:

First, this is primarily a notice to all lexicons for why they should support it. Any that already do are great. Always happy to be wrong and pleasantly surprised by better support than is shown in metadata.

Second, the reason for my presumption is that I’ve never seen anyone using more than two letter codes on BlueSky. :person_shrugging: Maybe the appView/client is removing this information and only leaving two letters :person_shrugging:.

Thank you for correcting the record. Can anyone confirm if we need to correct the app view to make better use of it ?

Yep! The standard.site people asked me to write this post.

Please tell me if I have made a good case.

( I feel slightly awkward advocating for localization while only speaking English )

1 Like

As @thisismissem.social correctly points out, any BCP 47 language tag is valid in the langs field of a Bluesky post.

Unfortunately, the Bluesky app currently only supports languages with two-letter ISO 639 codes.

There is a social-app PR open to add support for three-letter language codes, which would enable posts to be tagged in a much greater number of languages:

(h/t @psingletary.com for pointing me to this thread)

2 Likes

Fantastic News!

Thanks for pointing me to the PR.

Please give it a thumbs up (so it may hopefully gain traction)

1 Like