IPFS + ATProto for Archival Purposes

Continuing the discussion from Offer to help with transcripts of Seattle 2025 ATmosphereConf:

Hey @maboa.bsky.social starting this as a new thread (and going to move your post here)

1 Like

@bmann.ca @blaine.bsky.social This is tangential, but I wrote a bit about the potential of transcripts + atproto to help make archival content more resilient. I’ve not publicised it yet, but you can find it here. Resilient Archives on the AT Protocol here’s a heads-up that I’ve credited your work, and I’d welcome any corrections/feedback before pushing wider. Will probably start with Working Groups > WG Private Data. I’m particularly interested in whether you think IPFS can fit into the equation. Either way it could be a good area of discussion.

Regarding adding the rest of the content, I have a mechanism in place to pull the data from ionosphere.tv and match with atproto conference videos I downloaded from YouTube. We can use these to test the concept, while we wait for the original media to be restored.

I’ve also made some improvements to the Hyperaudio Lite Editor and aim to allow a transcript export in the ionosphere format, which may be useful moving forward.

2 Likes

So, I don’t know how much you’ve poked at the IPFS that already exists within ATProto.

The IPFS Foundation has sponsored the last two conferences and has been working through a number of emerging specs that include how atproto works.

Example: Matadisco: Can We Bootstrap Public Data Discovery with ATProto?

DASL https://dasl.ing/ - which has a bunch more emerging standards and

Join CID Congress for ongoing meetups CID Congress · Events Calendar

Blobs are content addressed. One could very well imagine republishing blobs across repos and layering on some search / discovery to get it delivered from different repos.

Export of account repos are CAR files

etc etc

I’ll note that IPFS-the-network is not something I mentioned here. With content addressing and other specs there are different network shapes and operators might gather without the p2p network layer.

2 Likes

That’s super-helpful. Thanks @bmann.ca. Good to know how much of the IPFS-like work exists already and that there’s active work via Matadisco / DASL. I’ll have a read. I’ll update my post to include this work before spreading more widely :slight_smile:

I’m not looking to get into the protocol layer myself – it’s more a means to an end. It’s good enough to know that the infrastucture is there or emerging. I actually want to build on the social side: discovery, curation, and Hyperaudio-style remixing of archival content. That’s more my area of expertise.

Quick question – I did wonder how persistent these blobs are. If a PDS disappears, does the content addressed by those CIDs exist somewhere, or is that still an open problem? I’m curious mostly because a remixing / discovery layer needs the underlying content to still be there to point at.

Short answer: it’s content addressed so you can “prove” it’s what was posted but it isn’t distributed other than that things may be cached by various app views. eg Bluesky might have a copy in their CDN.

@bad-example.com’s Hubble mirrors the entire network including blobs https://hubble.microcosm.blue/ — and there are more spinning up.

One can imagine user scale versions of copying blobs of the same thing in different ways.

Thanks again @bmann.ca I’ve updated the post to mention that. It’s great to see the Hubble example. Shows that it is possible. Important for remixing etc that the content is retrievable.

Hello @maboa.bsky.social and thanks for the intro Boris!

First of all, great article Mark ( Keeping Archives Alive: Resilience and Discovery on ATProto ). I have a degree in archival science and have been working on IT for archival institutions for quite a while now (https://vangarderen.net). You nailed the definition and distinction between record integrity and provenance. I love your forward thinking ideas about how AT Protocol and IPFS can be leveraged to support these requirements.

I live in Vancouver, Canada and am a member of the same co-working space as Boris. I’m currently in London for a conference (in midst of heatwave, hottest day on record since 1956). I will also attend https://dwebcamp.org and https://www.localfirstconf.com in Berlin as part of this trip where I expect the AT Proto community will be well represented.

I just finished speaking on a panel here about creating a resilient archives network for Palestinian academics and volunteers that are documenting war atrocities and cultural erasure.

I am advising on the technical architecture and AT Protocol and IPFS are in the mix. The intent is to make much of the collected digital media available publicly, out of the reach and control of US or other state & big tech interference, while also proving its integrity and provenance. However, the biggest blocker is that a portion of these collections need to remain private or accessible to a limited audience. Therefore, I am following the AT Proto Private Data Working Group closely. While encryption and/or private IPFS Clusters is one approach there is still a general mistrust of decentralized data storage. The other approach is to establish a closed peer-to-peer network of trusted nodes, using more traditional syncing/backup tools, run by trusted members of the community in geographically disperse locations (memory institutions and data centres are actively targeted for bombings).

As a decentralizationist, I’d really like to see an AT Protocol based solution become a viable option for this network. I look forward to more discussions and knowledge sharing to get there.

I should also point out that I believe AT Protocol can be used to create public, integrity-preserving provenance audit trails for AI inference and training, a separate but related topic.

1 Like

Yeah, totally-- i like to think of Atproto and IPFS mainnet as “nonrepudiable publication mechanisms”-- even if the actual pre-image is behind authN, anyone who has it can prove you published a pointer (via CID) to that preimage at a point in time, and that pointer passed through the relay and got indexed by everyone indexing that event stream/DAG. so it’s not for everyone, it’s as public as publishing to github in a lot of ways.

all that said, one of my personal hopes with DASL is that it could compose with the “blobstore” built into every PDS, such that some people might make “beefy PDSs” with per-user metering/billing and unlimited DASL storage, which would allow some did:plcs (institutional/archival ones) to broadcast their content over atproto (in various lexica, ideally) and thus deliberately create that nonrepudiable publication record, such that other could restore the preimages/blobs behind those CIDs later…

1 Like

Awesome write-up @maboa.bsky.social!

I’m chomping at the bit on so many of the ideas that you outline in your post, and there are a few things that we’re working on in Roundabout that relate directly to this question of provenance in a big way.

Much more to say and so little time to say it, but I just wanted to acknowledge the really great write-up! Thanks, and looking forward to what threads you pull on here!