Tolerance of the protocol to topology partitioning

It’s a few hours news that the US administration is calling in its generals to report from everywhere around the world. I don’t want to talk about this news in detail, but we are living in interesting times, and ATM any scenario is on the table.

We saw with the Ukraine-Russia war that a modern engagement between major military powers will likely include extensive damage to the respective infrastructure.

Cable-cutting operations already pose a significant concern, and it is likely that in the event of an escalation between NATO and Russia, or US/AU and China, even just increasing the sabre-rattling, intercontinental communication will be disrupted.

I understand it might be too early to discuss this, but I‘m wondering what will go wrong if everything goes right?
If ATProtocol becomes the backbone of Web N.0, and the network topology is disrupted, can the protocol remain resilient?

Example: If we have an AppView where the experience can only be built by recomposing the sequence of all the blobs from the various PDS, how can that experience be resilient to losing access to a quarter of the repos? Say a collaborative document editor, where we store each edit as a delta in each user repo… That AppView wouldn’t be able to recompose the final document state - basically, it would be broken.

If you take an equivalent Web 2.0 app, GDocs, I would expect Google to have replicas if not at the edge, at least on geographically closer nodes (e.g. US East, US West, Central EU, AU). In fact, S3 offers Global Replicas as a simple option. I would expect Web 2.0 app to have some level resiliency, perhaps enough to support accessing data or maybe to continue to survive autonomously in a partitioned topology scenario for some time.

cc @robin.berjon.com I know this is your bread, would be interesting to hear your thoughts if you find time to chip in…

2 Likes

While it’s possible to rely on the PDS alone for accessing data, I’m fairly certain that most app views are listening to the firehose and populating a local database with a copy of the records. They would then be able to backup and distribute this as usual.

2 Likes

I think using things that don’t exist yet doesn’t make this a good exercise :stuck_out_tongue:

Reminder that I work with the folks who make Automerge, which is a local first document store using CRDTs with which one could build such a thing, unlike GDocs that uses older Operational Transform tech. So for starters, I would build a thing that isn’t broken under partitions or online / offline status.

But it is a good idea to think about what resilience looks like!

Right now we have:

  • multiple relays, with a 24-48 (72 hour?) cache of the entire network
  • I think mostly running the JSON version, not the signed version? → having the signed version that can be verified would be good → as brookie says:

TLDR

  • get a rotation key and backup up your personal repo
  • get on an independent PDS and/or know how to migrate
  • edit your PDS env values to point to multiple relays
  • run a relay, either personally or with a small group

Realistically, it’s pretty easy to run a relay and we can keep spawning relays. A bit like running a Tor node?

Bigger asks:

  • backup moderation, including an approach to infrastructure moderation (this is illegal shit, but also attacks - DDOS, spam, etc) → this is something that a group of operators could coordinate on
  • an app view you like that does backfill and index/search (e.g. Blacksky)

Doesn’t exist yet / wild fancys:

  • a mini client / app view that fetches from your follows PDS directly as backup
  • mobile PDS (which is some proposed R&D for WG Eurosky I think?)
  • tailscale + PDS + tailscale-jetstream???
2 Likes

Will just note that Hegseth is first and foremost a propagandist, and this is probably another one of his bizarro self-aggrandizing performative stunts.

Which, regardless, have been saying for months that we should do a rigorous risk analysis that includes both technical and legal/political threats, and both from the US and from China, Russia, Iran and other attack vectors.

I’d be super interested in working on that with folks: scoping it, and especially understanding how the political and technical intersect. And of course we are already acting on a number of these fronts. To Boris’ point, some things that folks can be doing on a personal level, but non-technical users aren’t likely to unless there’s a clear and simple recipe. Which might be one of the outcomes.

4 Likes

Agree, I’m not speculating that this changes anything in the short-term.
But, we are 10-20% in of this US administration time. I was concerned about climate change and the world in 2050-2100. I am now interested in understanding what 2028 would look like.

I was thinking:

  • multi-PLC to be a strategic feature: if somebody cuts the cables now, it’s game over for the other side of the world
  • PLC services entry to support PDS replicas

Once both are in place, we can support a complete partitioning of the network.

If we want to discuss this we also need to agree on scenarios:

  • degraded network
  • partitioned network

For the latter, how relay replication and VPN usage compensate for the situation?

1 Like

Isn’t the way Tangled works currently?
What happens if we base AT infra on top of it, and Tangled can’t reconcile repo states correctly?
(was trying to avoid making names..)

I know at least one major !bsky appview is not doing that, but only following their users’ activity and/or caching the last 24hrs.

Besides, this wouldn’t work with some current private data proposals and users authorising AppViews after that data was created initially.

I mean, it’s complicated stuff, and we can iterate.
We just need to keep that in mind while we try to expand the protocol features, I guess.

It uses regular git for their knots which works fine under partitioned scenarios.

Sorry, it’s probably best to just use what we have today - how does everyone see bsky microblog content plus other elements of the stack - as a base case. New apps are going to make their own trade offs about their architecture and business logic.

Maybe step back and articulate what scenario you want to go over.

Ah, I see. This is isn’t the way I got it explained. Sorry for the confusion.

Do you think this conversation couldn’t inform future apps architectures?

My initial thought was a cable cut ops at scale that would completely separate the US/CA from EU or AU.

I understand that it might be more nuanced due to the diversity of infrastructure that comprises the backbone, so I’m happy to discuss network degradation.

I think the people building them will take their requirements and use cases into consideration and design an appropriate architecture.

ATProto isn’t a magic substrate much as we would want it to be. Distributed systems trade offs are everywhere.

So yeah, if you have a concrete app and use cases we might get somewhere but I’d prefer to focus on things that people are actually building / planning to build.

I think the people building them will take their requirements and use cases into consideration and design an appropriate architecture.

I know they do, but it still is an incredibly fragile infrastructure.

It might be tot random for this forum and sorry for the de-tour.

This is me surveying for the UK - Portugal leg of EIG like 20 years ago.

At galley time, we would have the clients and other interesting folks, mostly an intersection of UK academia and Navy - including a couple from subs.

We discussed many times about the criticality of undersea structure, which are basically defenseless: transoceanic cables are laid in the current swings at certain depths, their location and scope roughly being public knowledge. Subs are equipped with cable cutting devices, 20 years ago the insight consensus was that it would take few weeks maybe a month to basically halt Europe <> America comms.

It is true we built resilience but there are chokepoints like straits and cable landing stations. e.g. one vessel sinking in the Red Sea disrupted 25% of the Europe-India traffic Submarine Cable Security at Risk Amid Geopolitical Tensions &amp; Limited Repair Capabilities

Again, sorry for the detour.
I want to clarify this is not your random basement-dweller schizofrenia, but some concerns built from a good knowledge of the current state of affairs.

Here is a chat with Gemini summarising few things and adding some (old) insight knowledge: https://g.co/gemini/share/546ae10ac395
If you are interested, have a look at the last message. TLDR we agreed that a mid-worst case scenario would mean cutting off 90% of the Europe - America traffic in a matter of few weeks, for a long/indefinite period of time.

I mentioned before redundancy at PLC and PDS level.
With hindsight, I’d suggest at every other level too.

I love this community for the energy.
There is a ceiling on what you can achieve, though, with one instance of a service running on Hetzner fsn1, or a Raspberry Pi.
I have been in a WG where somebody mentioned a gov notification system for emergencies. That is something that should be built ground-up with regional outages resiliency in mind.

I think this is solvable.
AWS has a Well-Architected framework that I studied for my certificates, which teaches you how to build resiliency for single- and multi-region failures. The same goes for other hyperscalers.
Something I think could be useful is to draft something similar, a white paper to introduce these concepts to self-motivated kids who are jumping on board but don’t yet have the professional experience.
Something I really want to do sooner rather than later is give them services and templated solutions to avoid these hurdles.

Besides, I think some quick, immediate resolution would be:

BTW, I’m also happy to discuss what scenarios you wanted to solve with multiple relays and VPN solutions.