Offer to help with transcripts of Seattle 2025 ATmosphereConf

Sounds good. I think as a first pass at least I can grab all the data from your repo and try and ingest. For now I can use the videos from the site, although they don’t appear to be working for me. Just me? For example Protocol Governance & Hard Decentralization — ATmosphereConf 2026 cc @bmann.ca

Yeah not working. My guess is a streamplace issue which is why @blaine.bsky.social’s version also not working. I’ll ping Eli

And yeah! Lovely if you can use Blaine’s work! We do want it on the AtmosphereConf site.

Where to put transcripts so things can be everywhere is definitely a thing but we can use this as a PoC

The transcripts are all on-network! ionosphere.tv pds - the lexicons are tv.ionosphere.transcript and tv.ionosphere.streamTranscript – there are two mostly because of LLMs and size limitations on PDS records (streamTranscripts are chunked).

@maboa.bsky.social also worth looking at how I’m using panproto lenses.

In theory we could pull transcript data from GitHub. I also have a lightweight transcript viewer and editor that could be hosted on GH pages and allow for community editing, if required.

Thanks @blaine.bsky.social I’ll take a look!

1 Like

Yeah, mostly atproto people will want this as records on protocol, so that they can be used everywhere and not re-ingested.

Also: TRANSLATIONS!

That being said, Github / Tangled may in fact be part of that workflow??? This is a bigger thing, I wonder how much @iame.li has thought about this / modeled it for Streamplace.

Please do keep POC’ing as makes sense for you and then we can see who else wants to get involves – appreciate your work on this so far @maboa.bsky.social

1 Like

Here’s an update while we wait on the CDN fix. I used the time to dig deeper into the data model Blaine surfaced and prototype the ingestion path.

The data model turned out to be much richer than the two transcript collections suggested. Most of what I was going to ask is already there – paragraph and sentence boundaries etc all exist as annotation layers over the transcript. Speakers attach to talks directly via speakerUris, so per-talk rendering doesn’t need to map the anonymous SPEAKER_NN labels back to identities. The org.relationaltext.lens collection looks like what you meant by panproto lenses (?)

Managed to decode the timings encoding by comparing against my own JSON for the same talk:

  • startMs = anchor; each timings[i] advances a cursor by abs(value).
  • Positive = word duration; negative = silence;
  • streamTranscript variant: timings[0] = startMs (self-anchored)

To test the Ionosphere transcripts I built a converter that fetches talk + expression + segmentations + paragraph annotations + speaker records, joins by tokenIndex, and emits the same { words, paragraphs } format my renderer already uses. Single-speaker talks resolve ok, though multi-speaker would need the speaker mapping for per-paragraph attribution but works fine without it for now.

Tested end-to-end locally with an MP4 I’d created earlier. Generally it seems to work fine although one thing I did notice was that the word timings in your records show a bit of ‘drift’ versus my Parakeet-generated JSON. For example when I click “So that’s what I’m talking about” the audio plays “multi-stakeholder network” (which is the previous phrase in the audio). Apart from that pretty much same text in the same order, just looser per-word alignment. My guess is this is to do vanilla Whisper word timings which are not super accurate. WhisperX does forced alignment with a phoneme model and produces much tighter word boundaries.

A couple of quick questions:

  1. Curious – what do the 1 sentinel (in tv.ionosphere.transcript.timings[]) and the 0 markers (in tv.ionosphere.streamTranscript.timings[]) mean?
  2. Re translations – I imagine separate tv.ionosphere.transcript records per language pointing to the same talkUri, or something else?

I’m happy to keep producing transcripts with Parakeet/HLE and hand them over in a format you can ingest into the ionosphere lexicons (I think that’s the right term) if that’s something you’d be interested in. For now I’ll continue creating the JSON.

The good news is I think we can use @blaine.bsky.social’s atproto based format as the source of truth. We’d just need to address the timings issue, but we can move that upstream converting from the JSON I’m creating as a PoC.

1 Like

On reflection – we can still import and convert all the data from ionosphere… and see how well the timings match when the videos come back online. Maybe it’s good enough! :slight_smile:

I really like the idea of atproto being the source of truth, so would love to push in this direction!

1 Like

oh yes, we definitely want to end up as everything on atproto!

When permissioned data comes along then we also have an approach for private data, but still on protocol.

1 Like

Sounds good! Permissioned data sounds interesting. I’ve been doing a lot of work in the oral history area, where consent is an important aspect – a narrator might be happy for their story to be studied but not broadcast, communities want to control access to their own stories etc.

“Private but still on protocol” could make atproto viable for different types of archives, not just open talks.

I’m curious – where would permissions live / be expressed – at the lexicon level?

1 Like

Yes, I know some other archivists who are taking a look!

The threads around this topic are in Working Groups > WG Private Data - the extreme TLDR is Spaces + member lists, with how to do this being pluggable.

Amazing, thank you – I’ll dig into the Private Data WG. Spaces.

Also it’s good to know that there are other archivists looking into this! I’d love to hear what use cases they’re running into.

1 Like

It’s Peter Van Garderen - he shares a coworking space with me

He’s traveling, I’ll poke him to intro himself here with what he’s thinking.

1 Like

Great – thank you @bmann.ca. I don’t know Peter’s work but would love the intro whenever suits him, no rush. I’ve been mulling over whether something like IPFS + atproto could help with the resilience / authenticity side of archives – I’d really value their and others’ perspective. But that’s probably a post for WG Private data :slight_smile:

A post was merged into an existing topic: IPFS + ATProto for Archival Purposes