**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 12:58

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 12:58

Jons Mostovojs @jonn@social.doma.dev

Nov 16, 2022, 12:58

Jons Mostovojs @jonn@social.doma.dev

Anti-scraping people are tiring.

Federation duplicates data and the way masto is built, masto's federation backs it up for perpetual availability.

Scraping and archiving is another side of the same coin.

You opt into having a permanent record of your digital activity when you start posting online.

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:22

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:22

Nov 16, 2022, 13:22

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Wait, I saw something the other day that said that Masto data is ephemeral. Is it permanent or not?

(I *hope* it's permanent)

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:38

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:38

Nov 16, 2022, 13:38

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus one query is better than a thousand words.

Stuff we interact with gets permanently cached on our instances.

Maybe big instances work differently, but it's a matter of coping with scale rather than a design choice.

601be1e4e6069d19.png?1668605830

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:51

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 13:51

Nov 16, 2022, 13:51

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Hey, while I have you here, is it possible for you to check - how much of the stored data is text vs. images vs. video?

(I'm also curious what the distribution of images served is (e.g. are the vast majority a small subset at any given time that could cache well?), but it's more complicated to investigate, so I won't bug you with it).

This relates to this issue: https://github.com/mastodon/mastodon/issues/20255

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:55

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 13:55

Nov 16, 2022, 13:55

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus ok

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:03

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:03

Nov 16, 2022, 14:03

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus oh fuck, I have to increase the size of the instance or purge media files somehow. 🤔

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:05

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:05

Nov 16, 2022, 14:05

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus the instance is up since Autumn 2021, so for roughly a year.

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:07

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:07

Nov 16, 2022, 14:07

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn It's not clear to me - what's the ratio between text, images and video?

The issue in question is various things that can be done to simultaneously increase image quality (which we're getting complaints about) while decreasing image storage size (which is always an issue). But some of the possibilities have complications, such as dealing with legacy clients.

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:08

**Jons Mostovojs** @jonn@social.doma.dev · Nov 16, 2022, 14:08

Nov 16, 2022, 14:08

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus ah, ok, let me find videos I guess.

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:14

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:14

Nov 16, 2022, 14:14

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Right now we're only using JPEG (and PNG, strangely, despite them being massive!) But also, rather than controlling file size, we're simply aggressively downscaling them - when you get a better compression/quality ratio with less downscaling but a lower CRF. We could also support WebM (~75% the size) and (AVIF is ~50% the size), but then we'd need a caching image proxy server for legacy clients, so the question arises what sort of ratio of cache hits we might get.

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:15

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 14:15

Nov 16, 2022, 14:15

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn There's also the possibility of image fingerprinting to find duplicates so you only need to store one copy (because let's face it, there's a LOT of repeated content on social media!). But it's a more complex issue.

I haven't looked into how videos are handled yet. At the least fingerprinting / duplicates would apply there too.

**Jons Mostovojs** @jonn@social.doma.dev · 2022-11-16T14:23:04Z

Jons Mostovojs @jonn@social.doma.dev

@nafnlaus it's a bit embarassing if we don't do fingerprinting. Maybe I need to move to a deduping FS? :)

Nov 16, 2022, 14:23 · Web · · ·

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:13

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:13

Nov 16, 2022, 15:13

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Fingerprinting *exact* matches is pretty easy, but gets a bit trickier when you're dealing with inexact matches, though. Metadata may be different, JPEG compression noise different, resolution different, etc. There are readily available hash algos to create unique hashes independent of this sort of stuff, but it does add complications (like, for example, making sure you only store the highest-quality version).

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:15

**Nafnlaus 🇮🇸 🇺🇦** @nafnlaus@fosstodon.org · Nov 16, 2022, 15:15

Nov 16, 2022, 15:15

Nafnlaus 🇮🇸 🇺🇦 @nafnlaus@fosstodon.org

@jonn Honestly, though, with a mix of fingerprinting, modern image formats (with imgproxy for serving to legacy clients), proper image handling, etc I suspect we'd get *over* an order of magnitude increase in user capacity for a given amount of disk space.

Trending now

Resources

Developers

What is Mastodon?

social.doma.dev

More…