@jonn Wait, I saw something the other day that said that Masto data is ephemeral. Is it permanent or not?
(I *hope* it's permanent)
@nafnlaus one query is better than a thousand words.
Stuff we interact with gets permanently cached on our instances.
Maybe big instances work differently, but it's a matter of coping with scale rather than a design choice.
@jonn Hey, while I have you here, is it possible for you to check - how much of the stored data is text vs. images vs. video?
(I'm also curious what the distribution of images served is (e.g. are the vast majority a small subset at any given time that could cache well?), but it's more complicated to investigate, so I won't bug you with it).
This relates to this issue: https://github.com/mastodon/mastodon/issues/20255
@nafnlaus ok
@nafnlaus the instance is up since Autumn 2021, so for roughly a year.
@jonn There's also the possibility of image fingerprinting to find duplicates so you only need to store one copy (because let's face it, there's a LOT of repeated content on social media!). But it's a more complex issue.
I haven't looked into how videos are handled yet. At the least fingerprinting / duplicates would apply there too.
@nafnlaus it's a bit embarassing if we don't do fingerprinting. Maybe I need to move to a deduping FS? :)
@jonn Fingerprinting *exact* matches is pretty easy, but gets a bit trickier when you're dealing with inexact matches, though. Metadata may be different, JPEG compression noise different, resolution different, etc. There are readily available hash algos to create unique hashes independent of this sort of stuff, but it does add complications (like, for example, making sure you only store the highest-quality version).
@jonn Honestly, though, with a mix of fingerprinting, modern image formats (with imgproxy for serving to legacy clients), proper image handling, etc I suspect we'd get *over* an order of magnitude increase in user capacity for a given amount of disk space.
@jonn Right now we're only using JPEG (and PNG, strangely, despite them being massive!) But also, rather than controlling file size, we're simply aggressively downscaling them - when you get a better compression/quality ratio with less downscaling but a lower CRF. We could also support WebM (~75% the size) and (AVIF is ~50% the size), but then we'd need a caching image proxy server for legacy clients, so the question arises what sort of ratio of cache hits we might get.