**Jons Mostovojs** @jonn@social.doma.dev · 2022-11-16T12:58:52Z

Jons Mostovojs @jonn@social.doma.dev

Jons Mostovojs @jonn@social.doma.dev

Anti-scraping people are tiring.

Federation duplicates data and the way masto is built, masto's federation backs it up for perpetual availability.

Scraping and archiving is another side of the same coin.

You opt into having a permanent record of your digital activity when you start posting online.

Nov 16, 2022, 12:58 · Tusky · · ·

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:22

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:22

Nov 16, 2022, 13:22

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Wait, I saw something the other day that said that Masto data is ephemeral. Is it permanent or not?

(I *hope* it's permanent)

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:38

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:38

Nov 16, 2022, 13:38

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus one query is better than a thousand words.

Stuff we interact with gets permanently cached on our instances.

Maybe big instances work differently, but it's a matter of coping with scale rather than a design choice.

601be1e4e6069d19.png?1668605830

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:51

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:51

Nov 16, 2022, 13:51

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Hey, while I have you here, is it possible for you to check - how much of the stored data is text vs. images vs. video?

(I'm also curious what the distribution of images served is (e.g. are the vast majority a small subset at any given time that could cache well?), but it's more complicated to investigate, so I won't bug you with it).

This relates to this issue: https://github.com/mastodon/mastodon/issues/20255

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:55

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:55

Nov 16, 2022, 13:55

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus ok

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:03

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:03

Nov 16, 2022, 14:03

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus oh fuck, I have to increase the size of the instance or purge media files somehow. 🤔

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:05

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:05

Nov 16, 2022, 14:05

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus the instance is up since Autumn 2021, so for roughly a year.

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:07

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:07

Nov 16, 2022, 14:07

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn It's not clear to me - what's the ratio between text, images and video?

The issue in question is various things that can be done to simultaneously increase image quality (which we're getting complaints about) while decreasing image storage size (which is always an issue). But some of the possibilities have complications, such as dealing with legacy clients.

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:08

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:08

Nov 16, 2022, 14:08

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus ah, ok, let me find videos I guess.

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:14

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:14

Nov 16, 2022, 14:14

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Right now we're only using JPEG (and PNG, strangely, despite them being massive!) But also, rather than controlling file size, we're simply aggressively downscaling them - when you get a better compression/quality ratio with less downscaling but a lower CRF. We could also support WebM (~75% the size) and (AVIF is ~50% the size), but then we'd need a caching image proxy server for legacy clients, so the question arises what sort of ratio of cache hits we might get.

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:15

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:15

Nov 16, 2022, 14:15

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn There's also the possibility of image fingerprinting to find duplicates so you only need to store one copy (because let's face it, there's a LOT of repeated content on social media!). But it's a more complex issue.

I haven't looked into how videos are handled yet. At the least fingerprinting / duplicates would apply there too.

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:23

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:23

Nov 16, 2022, 14:23

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus it's a bit embarassing if we don't do fingerprinting. Maybe I need to move to a deduping FS? :)

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:13

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:13

Nov 16, 2022, 15:13

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Fingerprinting *exact* matches is pretty easy, but gets a bit trickier when you're dealing with inexact matches, though. Metadata may be different, JPEG compression noise different, resolution different, etc. There are readily available hash algos to create unique hashes independent of this sort of stuff, but it does add complications (like, for example, making sure you only store the highest-quality version).

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:15

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:15

Nov 16, 2022, 15:15

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Honestly, though, with a mix of fingerprinting, modern image formats (with imgproxy for serving to legacy clients), proper image handling, etc I suspect we'd get *over* an order of magnitude increase in user capacity for a given amount of disk space.

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:12

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:12

Nov 16, 2022, 14:12

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus working on making stats now.

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:14

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:14

Nov 16, 2022, 14:14

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus but yeah, we already see that text is negligibly small compared to media. It's comparable to the size of all the thumbnails generated.

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:32

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:32

Nov 16, 2022, 14:32

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus ok, assuming all the videos are converted to mp4, the orders of magnitude are:

Text: hundreds of megabytes (500 MB), Videos: thousands of megabytes (5 GB), Images: tens of thousands of megabytes (25 GB).

4c03351e2519e5dd.png?1668609144

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:09

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:09

Nov 16, 2022, 15:09

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Okay, very interesting! So images ARE the real culrpit that needs to be fixed!

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:54

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:54

Nov 16, 2022, 13:54

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus to clarify — only *some* toots get federated. The criteria don't bother me too much, perhaps it's "got boosted in the last 30' unless someone from the instance has interacted with it".

Either way, it takes *normal usage* for a toot to be scraped by another instance.

**івась тарасик** @tivasyk@mastodon.social · Nov 16, 2022, 16:08

**івась тарасик** @tivasyk@mastodon.social · Nov 16, 2022, 16:08

Nov 16, 2022, 16:08

івась тарасик @tivasyk@mastodon.social

@jonn yep… i think keeping everything forever is a bad default. i'd much prefer a) a sensible default of say 1 year, b) dissappearing toots by default, c) option to mark a post as persistent at creation time or later, d) automatic scraping of everything not liked / marked as persistent / bookmarked etc…

having said that, i don't know how the server works and if what i imagine as a better solution is actually any good. guess you folks will fix it sooner or later :-)