A sketch of archived web content

Started a thread here @bmann.ca on Bluesky replying @desertthunder.dev sharing his article reader skill, but very much pivoted it into my ideas around collectively archived web content on atproto.

The short idea is like Wayback Machine, but done by individuals, and stored in new permissioned data.

This is very much like what Pinboard offered – it would archive / snapshot your bookmark links, including a PDF to capture the entire look and feel of the site. This was a paid feature, and might be a part of a great paid feature for Semble in the future (I don’t speak for them here, just riffing on ideas and encouraging them to look at what people might pay for AND very much experimenting with permissioned data).

So, can we make a community lexicon for archived web content?

community.lexicon.bookmark.archive might be a spot for it. Here are some other ideas of what to include. The URL field from `community.lexicon.bookmark

  • link to a wayback machine entry – meaning, as part of bookmarking an app can ping WBM to archive, and then fetch the latest link
  • link to a blob, which might be of type PDF, markdown, text, html, json etc (or WACZ or other formats, I am not an archiving expert – for my own purposes, PDF and markdown would be most useful)
  • time of archiving
  • an optional description e.g. “snapshot of page after the story about the goose was removed from home page”

Wayback Machine is public, and some of these entries could be public, but I imagine permissioned archiving, especially for individuals, would be very valuable and not run into any issues around copyrighted content.

Apps could build all sorts of things around the aggregating version of such things – search, a timeline view of web pages that’s much like Wayback Machine, etc etc.

Personally, I see this also being a great resource for agents, community search engines,

Is making a community lexicon for archived web content useful? Feedback?

4 Likes

Crossposting some related thoughts I had prior, specifically focused on archiving records:

2 Likes

Possible partner in https://perma.cc/

Yeah, there are lots of services that could hook in here.

Personally, other than wayback machine, I’d love to have the archive in my own PDS so I have self-contained archives.

or, I guess, I’m pointing / sharing at someone else’s blob? That’s more an app level concern, could index everyone, and I don’t even need to snapshot a website – someone else has a blob stored from 3 years ago, and that works for me, and I copy the blob to my PDS.

Which, yes, distributes verified / content addressed blobs between PDS hosts :slight_smile:

1 Like

Yeah, I have no idea what bsky is going to do here. I hope they refactor to actually store it. But are they going to store a copy, or just a pointer to the original, in which case it if it is deleted it is gone.

They will probably do the latter.

I can see value in at uri archiving, but feels like a very special case. @mfzx.net any ideas how this would be represented in a lexicon different from what I specced? Or just at:// URI in the URI record?

The only additional fields that might be useful in the case of archiving records would be historical information about hosting, i.e. which PDS the creator of the archived record was using at the time, a reference to the commit that introduced that particular version of the record, etc. Besides that, it’d be the same as for web content, just with an AT-URI and the raw record CBOR as the archived contents.

1 Like

There’s existing work on this that doesn’t seem to have been mentioned already:

3 Likes