Proposal: Atomizer, a library of data converters and services

@erlend.sh reminded me of glue today, so let’s get this idea out there.

Thanks to the good people in the EU, incumbent platforms are required to provide some kind of data export for their users. I was able to bring my entire Google Photos library into Immich because of Google Takeout. Facebook, Instagram, Twitter, all of the big platforms offer some version of this. As we grow the Atmosphere, people moving over their social media accounts will want to bring this data with them. Starting a new account with your full post archive is a much better experience than starting with a blank slate. How might we make this easy for users?

Atomizer, a rosetta stone of takeout zips and lexicon

I propose a community project to build tooling and services which convert between lexicon and takeout archives.

Data Import

Accept these data takeout archives and convert them into various appropriate lexicon. Atomizer would contain a series of libraries/functions for converting common compatible data, such as:

  • Instagram → Flashes | Bluesky | Spark
  • Twitter → Bluesky | Anisota
  • Substack → Leaflet | Whitewind | PiPup

Libraries would provide standard logic for interpreting these archives and writing them to the PDS.

Data Conversion

The second component is for converting between similar lexicon. ATproto facilitates competition, and we have seen many emerging lexicon for the same kinds of applications, each with different trade-offs and design decisions. Luckily, public JSON files of similar schemas are not hard to translate. I propose the second aspect of Atomizer provide tooling for conversion between similar lexicon to facilitate data import and compatibility between competing apps on the Atmosphere, such as:

  • Bluesky ↔ Anisota
  • Bluesky ↔ Flashes ↔ Spark
  • Leaflet ↔ Whitewind ↔ PiPup ↔ Pckt
  • Community Bookmarks ↔ Monomark ↔ Semble

Not all similar conversions will maintain full data integrity. Sufficient warning of data loss will be needed for conversions that will not maintain all data. Which brings me to the third component.

Services

These public libraries should enable the creation of data import and conversion services. The project itself can have a flagship service on a website, allowing users to submit their takeout file and upload the converted lexicon to their PDS. Or, choose a collection on your PDS to convert to another. Want to move from Whitewind to Leaflet? Go to this website, select that conversion, click go.

Beyond a community hosted service, the open libraries would allow any Atmosphere developers to integrate import and conversion services into their own apps. What if you could give your Instagram takeout to Flashes and they import all your posts for you? What if Pckt’s onboarding had an option to convert all your old Pipup posts? What if Blacksky let you import your Twitter archive? What if you want to set up your own online conversion service separate from the community instance, with different technical features. You might offer a paid service which syncs between your Substack and Leaflet accounts using a cloud service. The tooling is shared infrastructure for the whole community.

What do we need?

I would love feedback on this concept overall. I am not a technical person, so especially thoughts on how to architect such a collection of tooling would be appreciated. After something like this is built, would there be community funding to operate it? Can we fund it’s development? Can it be built in a way that minimizes costs, such as client side logic (thinking like PDSMoover)? Would such a community resource be useful to you as a developer? What user stories would this enable that you wish were easier?

4 Likes

probably the biggest problem you’ll encounter with this is the rate limits around writes, which is like 1666 per day or something.

Great idea! Sounds like https://granary.io/ and/or Project Cambria: Translate your data with lenses

2 Likes

Yup. I’m only doing following lists for ATlast and I’ve included my own rate limits (follow up to 1k users) so users aren’t surprised they can’t like or post after following a bunch of accounts.

Is there a way to know if an account has been rate limited? That would help devs include that info in their converters and apps. I know when I got rate limited there wasn’t anything shown to me in bsky.app - just couldn’t like or create posts.

Rate Limits | Bluesky ?

yes these seem like the right idea. Needs excellent UX and DX

1 Like

One mushroom-hosted alt account later and some testing, Granary is quite out of date and novice in ATproto integration. Maybe a service could be built on top of it. Also seems they are just a python library. I imagine ATdevs would want npm packages, rust crates, deno, etc. Perhaps it can provide a kickstart and we can make it more at home in the atmosphere for our needs.

The problem is that - given the architecture of ATProto - you may cause trouble downstream if you throw everything at once at the firehose. Hence the rate limits.
To run this as a service for other you will also run into IP-based rate limits afaik. So the solution I’m working on for Flashes is to run the import in batches on the device, carefully monitoring the rate limits in the headers.

1 Like

That should still be feasible tho? Even if it’s all clientside and in a background process, like syncing your icloud photos. Rate limits are fair, just a design constraint. Also, depending on the service, the zip archive could be large (most social archives aren’t, but what if I’m migrating from flickr to Grain.social?).

Or, hacky idea I have no idea if it has legs, what if you restored the whole repo at once? When you migrate to another PDS, you upload the whole CAR and blobs. What if you download the CAR, assemble your new content into lexicon, reassemble the CAR, then restore the account to the same PDS?

Sounds to me like a cli would be the best fit? These takeouts are likely on your main computer, the cli would be there, you’d authorize it, point it at the takeout dir, and then let it chug along at a rate that would be guaranteed to be within rate limits. Would that be enough? Or are you set on this having some sort of (web) UI?

Yes that’s a fairly elegant solution for a certain set of users. I’ve done it for uploading an Instagram archive as Bluesky data. They are usually piecemeal scripts that do only one kind of conversion. I want a website for normies. I also want a comprehensive library so that if cli is your jam you can convert all sorts of stuff in one package.

Hmm! granary has been the conversion logic for Bridgy Fed and a few other services I run, aomng others by other people, for years. I’m always open to bug reports and feature requests, feel free to let me know if you have anything specific!

My take on granary was likely unfair. I tried using the demo on the site and it refused my non mycosphere account and relied on app passwords. If y’all are using it for bridgy I’m sure it’s more than capable.

Ah, makes sense. No worries! The demo UI is definitely dated and unloved, agreed. I’d love to get it onto OAuth at some point. The library isn’t dependent on app passwords and fully supports all PDSes, largely based on https://lexrpc.readthedocs.io/ and https://arroba.readthedocs.io/.