Tolerance of the protocol to topology partitioning

erikologic.bsky.social · September 26, 2025, 8:11pm

It’s a few hours news that the US administration is calling in its generals to report from everywhere around the world. I don’t want to talk about this news in detail, but we are living in interesting times, and ATM any scenario is on the table.

We saw with the Ukraine-Russia war that a modern engagement between major military powers will likely include extensive damage to the respective infrastructure.

Cable-cutting operations already pose a significant concern, and it is likely that in the event of an escalation between NATO and Russia, or US/AU and China, even just increasing the sabre-rattling, intercontinental communication will be disrupted.

I understand it might be too early to discuss this, but I‘m wondering what will go wrong if everything goes right?
If ATProtocol becomes the backbone of Web N.0, and the network topology is disrupted, can the protocol remain resilient?

Example: If we have an AppView where the experience can only be built by recomposing the sequence of all the blobs from the various PDS, how can that experience be resilient to losing access to a quarter of the repos? Say a collaborative document editor, where we store each edit as a delta in each user repo… That AppView wouldn’t be able to recompose the final document state - basically, it would be broken.

If you take an equivalent Web 2.0 app, GDocs, I would expect Google to have replicas if not at the edge, at least on geographically closer nodes (e.g. US East, US West, Central EU, AU). In fact, S3 offers Global Replicas as a simple option. I would expect Web 2.0 app to have some level resiliency, perhaps enough to support accessing data or maybe to continue to survive autonomously in a partitioned topology scenario for some time.

cc @robin.berjon.com I know this is your bread, would be interesting to hear your thoughts if you find time to chip in…

brookie.blog · September 26, 2025, 8:41pm

While it’s possible to rely on the PDS alone for accessing data, I’m fairly certain that most app views are listening to the firehose and populating a local database with a copy of the records. They would then be able to backup and distribute this as usual.

bmann.ca · September 26, 2025, 8:53pm

I think using things that don’t exist yet doesn’t make this a good exercise

Reminder that I work with the folks who make Automerge, which is a local first document store using CRDTs with which one could build such a thing, unlike GDocs that uses older Operational Transform tech. So for starters, I would build a thing that isn’t broken under partitions or online / offline status.

But it is a good idea to think about what resilience looks like!

Right now we have:

multiple relays, with a 24-48 (72 hour?) cache of the entire network
I think mostly running the JSON version, not the signed version? → having the signed version that can be verified would be good → as brookie says:

TLDR

get a rotation key and backup up your personal repo
get on an independent PDS and/or know how to migrate
edit your PDS env values to point to multiple relays
run a relay, either personally or with a small group

Realistically, it’s pretty easy to run a relay and we can keep spawning relays. A bit like running a Tor node?

Bigger asks:

backup moderation, including an approach to infrastructure moderation (this is illegal shit, but also attacks - DDOS, spam, etc) → this is something that a group of operators could coordinate on
an app view you like that does backfill and index/search (e.g. Blacksky)

Doesn’t exist yet / wild fancys:

a mini client / app view that fetches from your follows PDS directly as backup
mobile PDS (which is some proposed R&D for WG Eurosky I think?)
tailscale + PDS + tailscale-jetstream???

ivansigal.bsky.social · September 26, 2025, 9:21pm

Will just note that Hegseth is first and foremost a propagandist, and this is probably another one of his bizarro self-aggrandizing performative stunts.

Which, regardless, have been saying for months that we should do a rigorous risk analysis that includes both technical and legal/political threats, and both from the US and from China, Russia, Iran and other attack vectors.

I’d be super interested in working on that with folks: scoping it, and especially understanding how the political and technical intersect. And of course we are already acting on a number of these fronts. To Boris’ point, some things that folks can be doing on a personal level, but non-technical users aren’t likely to unless there’s a clear and simple recipe. Which might be one of the outcomes.

erikologic.bsky.social · September 27, 2025, 11:14am

Agree, I’m not speculating that this changes anything in the short-term.
But, we are 10-20% in of this US administration time. I was concerned about climate change and the world in 2050-2100. I am now interested in understanding what 2028 would look like.

erikologic.bsky.social · September 27, 2025, 11:21am

I was thinking:

multi-PLC to be a strategic feature: if somebody cuts the cables now, it’s game over for the other side of the world
PLC services entry to support PDS replicas

Once both are in place, we can support a complete partitioning of the network.

If we want to discuss this we also need to agree on scenarios:

degraded network
partitioned network

For the latter, how relay replication and VPN usage compensate for the situation?

erikologic.bsky.social · September 27, 2025, 11:23am

Isn’t the way Tangled works currently?
What happens if we base AT infra on top of it, and Tangled can’t reconcile repo states correctly?
(was trying to avoid making names..)

erikologic.bsky.social · September 27, 2025, 11:26am

I know at least one major !bsky appview is not doing that, but only following their users’ activity and/or caching the last 24hrs.

Besides, this wouldn’t work with some current private data proposals and users authorising AppViews after that data was created initially.

I mean, it’s complicated stuff, and we can iterate.
We just need to keep that in mind while we try to expand the protocol features, I guess.

bmann.ca · September 27, 2025, 3:33pm

It uses regular git for their knots which works fine under partitioned scenarios.

Sorry, it’s probably best to just use what we have today - how does everyone see bsky microblog content plus other elements of the stack - as a base case. New apps are going to make their own trade offs about their architecture and business logic.

Maybe step back and articulate what scenario you want to go over.

erikologic.bsky.social · September 27, 2025, 10:31pm

Ah, I see. This is isn’t the way I got it explained. Sorry for the confusion.

erikologic.bsky.social · September 27, 2025, 10:32pm

Do you think this conversation couldn’t inform future apps architectures?

erikologic.bsky.social · September 27, 2025, 10:41pm

My initial thought was a cable cut ops at scale that would completely separate the US/CA from EU or AU.

I understand that it might be more nuanced due to the diversity of infrastructure that comprises the backbone, so I’m happy to discuss network degradation.

bmann.ca · September 28, 2025, 12:43am

I think the people building them will take their requirements and use cases into consideration and design an appropriate architecture.

ATProto isn’t a magic substrate much as we would want it to be. Distributed systems trade offs are everywhere.

So yeah, if you have a concrete app and use cases we might get somewhere but I’d prefer to focus on things that people are actually building / planning to build.

erikologic.bsky.social · September 28, 2025, 8:36pm

I think the people building them will take their requirements and use cases into consideration and design an appropriate architecture.

I know they do, but it still is an incredibly fragile infrastructure.

It might be tot random for this forum and sorry for the de-tour.

This is me surveying for the UK - Portugal leg of EIG like 20 years ago.

At galley time, we would have the clients and other interesting folks, mostly an intersection of UK academia and Navy - including a couple from subs.

We discussed many times about the criticality of undersea structure, which are basically defenseless: transoceanic cables are laid in the current swings at certain depths, their location and scope roughly being public knowledge. Subs are equipped with cable cutting devices, 20 years ago the insight consensus was that it would take few weeks maybe a month to basically halt Europe <> America comms.

It is true we built resilience but there are chokepoints like straits and cable landing stations. e.g. one vessel sinking in the Red Sea disrupted 25% of the Europe-India traffic Submarine Cable Security at Risk Amid Geopolitical Tensions & Limited Repair Capabilities

Again, sorry for the detour.
I want to clarify this is not your random basement-dweller schizofrenia, but some concerns built from a good knowledge of the current state of affairs.

Here is a chat with Gemini summarising few things and adding some (old) insight knowledge: https://g.co/gemini/share/546ae10ac395
If you are interested, have a look at the last message. TLDR we agreed that a mid-worst case scenario would mean cutting off 90% of the Europe - America traffic in a matter of few weeks, for a long/indefinite period of time.

erikologic.bsky.social · September 28, 2025, 8:54pm

I mentioned before redundancy at PLC and PDS level.
With hindsight, I’d suggest at every other level too.

I love this community for the energy.
There is a ceiling on what you can achieve, though, with one instance of a service running on Hetzner fsn1, or a Raspberry Pi.
I have been in a WG where somebody mentioned a gov notification system for emergencies. That is something that should be built ground-up with regional outages resiliency in mind.

I think this is solvable.
AWS has a Well-Architected framework that I studied for my certificates, which teaches you how to build resiliency for single- and multi-region failures. The same goes for other hyperscalers.
Something I think could be useful is to draft something similar, a white paper to introduce these concepts to self-motivated kids who are jumping on board but don’t yet have the professional experience.
Something I really want to do sooner rather than later is give them services and templated solutions to avoid these hurdles.

Besides, I think some quick, immediate resolution would be:

PLC replicas to be geodistributed with geodns support, which ATM looks like is not: DNS Propagation Checker - Global DNS Testing Tool
This is highly critical.
PDS and other services to apply the same principle.

erikologic.bsky.social · September 28, 2025, 8:55pm

BTW, I’m also happy to discuss what scenarios you wanted to solve with multiple relays and VPN solutions.