Follow

Anti-scraping people are tiring.

Federation duplicates data and the way masto is built, masto's federation backs it up for perpetual availability.

Scraping and archiving is another side of the same coin.

You opt into having a permanent record of your digital activity when you start posting online.

@jonn Wait, I saw something the other day that said that Masto data is ephemeral. Is it permanent or not?

(I *hope* it's permanent)

@nafnlaus one query is better than a thousand words.

Stuff we interact with gets permanently cached on our instances.

Maybe big instances work differently, but it's a matter of coping with scale rather than a design choice.

@jonn Hey, while I have you here, is it possible for you to check - how much of the stored data is text vs. images vs. video?

(I'm also curious what the distribution of images served is (e.g. are the vast majority a small subset at any given time that could cache well?), but it's more complicated to investigate, so I won't bug you with it).

This relates to this issue: github.com/mastodon/mastodon/i

@nafnlaus oh fuck, I have to increase the size of the instance or purge media files somehow. 🤔

@nafnlaus the instance is up since Autumn 2021, so for roughly a year.

@jonn It's not clear to me - what's the ratio between text, images and video?

The issue in question is various things that can be done to simultaneously increase image quality (which we're getting complaints about) while decreasing image storage size (which is always an issue). But some of the possibilities have complications, such as dealing with legacy clients.

@jonn Right now we're only using JPEG (and PNG, strangely, despite them being massive!) But also, rather than controlling file size, we're simply aggressively downscaling them - when you get a better compression/quality ratio with less downscaling but a lower CRF. We could also support WebM (~75% the size) and (AVIF is ~50% the size), but then we'd need a caching image proxy server for legacy clients, so the question arises what sort of ratio of cache hits we might get.

@jonn There's also the possibility of image fingerprinting to find duplicates so you only need to store one copy (because let's face it, there's a LOT of repeated content on social media!). But it's a more complex issue.

I haven't looked into how videos are handled yet. At the least fingerprinting / duplicates would apply there too.

@nafnlaus it's a bit embarassing if we don't do fingerprinting. Maybe I need to move to a deduping FS? :)

@jonn Fingerprinting *exact* matches is pretty easy, but gets a bit trickier when you're dealing with inexact matches, though. Metadata may be different, JPEG compression noise different, resolution different, etc. There are readily available hash algos to create unique hashes independent of this sort of stuff, but it does add complications (like, for example, making sure you only store the highest-quality version).

@jonn Honestly, though, with a mix of fingerprinting, modern image formats (with imgproxy for serving to legacy clients), proper image handling, etc I suspect we'd get *over* an order of magnitude increase in user capacity for a given amount of disk space.

@nafnlaus but yeah, we already see that text is negligibly small compared to media. It's comparable to the size of all the thumbnails generated.

@nafnlaus ok, assuming all the videos are converted to mp4, the orders of magnitude are:

Text: hundreds of megabytes (500 MB), Videos: thousands of megabytes (5 GB), Images: tens of thousands of megabytes (25 GB).

@jonn Okay, very interesting! So images ARE the real culrpit that needs to be fixed!

@nafnlaus to clarify — only *some* toots get federated. The criteria don't bother me too much, perhaps it's "got boosted in the last 30' unless someone from the instance has interacted with it".

Either way, it takes *normal usage* for a toot to be scraped by another instance.

@jonn yep… i think keeping everything forever is a bad default. i'd much prefer a) a sensible default of say 1 year, b) dissappearing toots by default, c) option to mark a post as persistent at creation time or later, d) automatic scraping of everything not liked / marked as persistent / bookmarked etc…

having said that, i don't know how the server works and if what i imagine as a better solution is actually any good. guess you folks will fix it sooner or later :-)

@tivasyk I'm for allowing scraping (bulk downloading by 3rd parties) and archival, not scrapping (destroying content)! :)

@jonn ah, sorry, i was thinking about that other thing :-) real sorry, i have to boost my attention circuits. coffee time!

Sign in to participate in the conversation
Doma Social

Mastodon server of https://doma.dev.