Lemmy World outages

lwadmin@lemmy.world · edit-2 1 year ago

Lemmy World outages

solrize@lemmy.world · 1 year ago

I have to wonder why expensive SQL queries in Lemmy operations even exist. As Lemmy scales, won’t those queries get executed more often just as part of normal operation? That would say to me that the Lemmy software needs optimization. Otherwise there will be scaling issues even if the attacks stop.

kameecoding@lemmy.world · 1 year ago

the version number is 0.18.4 that should give you a hint.

it’s entirely possible that these simply haven’t been optimized yet.

Corkyskog@sh.itjust.works · 1 year ago

The version number is 0.18.4

That’s fascinating, I would have expected it to be more like 1.18.4-beta. I thought zeros were meant for unreleased product.

grue@lemmy.world · 1 year ago

As the other reply mentioned, there are different versioning schemes, but traditionally version 1.0 means “feature complete,” and of course one would traditionally wait until feature-complete to release to the public.

But this is a Free Software project, so it’s different. You and I aren’t “the public;” we’re participants in the project. Essentially, everyone in the lemmyverse is a beta-tester – except we’re not testing a beta, or an alpha, or even some sort of “developer preview release,” for that matter. We’re testing an extremely early experiment!

DreamButt@lemmy.world · 1 year ago

Honestly it’s all arbirtrary and there are several possible standards they could be following. Or they could be yoloing it since people all of sudden people started flying over (which you can’t ever really account for)

misterbassman@lemmy.world · 1 year ago

That’s exactly what is happening now. Lemmy is a very young codebase and up until very recently only had a tiny user base, so optimisation wasn’t that important.

Over the last few months the Devs have been working hard to improve things, but there is a lot of ground to cover

solrize@lemmy.world · 1 year ago

I wonder whether writing the backend in Rust was a premature optimization in its own right in that case. Lemmy can be seen as a fairly simple CRUD app whose work is mostly in the database, plus some network communications with federated instances.

AlmightySnoo 🐢🇮🇱🇺🇦@lemmy.world · 1 year ago

I wonder whether writing the backend in Rust was a premature optimization in its own right in that case.

I think too that’s the case, as it turns out the bottleneck was really the SQL queries and the DB design, not much the programming language.

fuck_u_spez_in_particular@lemmy.world · 1 year ago

Yet it’s not optimizing prematurely…

Everyone who has to do a little bit more with databases, knows that it’s often the database which is the bottleneck.

Rust is a great language not just because of its performance.

fuck_u_spez_in_particular@lemmy.world · 1 year ago

Why should writing something in Rust be a premature optimization? I don’t choose Rust because of its performance (at least that’s not the furst thing that comes to my mind) but because of language ergonomics and because of its strictness which makes maintenence much less painful.

woelkchen@lemmy.world · 1 year ago

Feel free to help out kbin which is written in PHP.

fuck_u_spez_in_particular@lemmy.world · 1 year ago

I hope that I don’t have to write PHP anymore ever in my life, so sorry, a definitive no.

AlmightySnoo 🐢🇮🇱🇺🇦@lemmy.world · 1 year ago

The devs aren’t DB experts (no harm in saying this), as for example a while ago someone spotted an SQL query where Lemmy used to do filtering after a huge join, instead of joining after filtering. SQL experts need to help them here.

solrize@lemmy.world · edit-2 1 year ago

I’m not an expert either unfortunately, but using EXPLAIN on slow queries can go a long way.

The most demystifying documents I know of about SQL query planning are actually from SQLite. Understanding them can help figure out how to optimize SQL in general, since they explain how SQL execution engines work:

dbilitated@aussie.zone · 1 year ago

I’m pretty good with SQL… well I used to be, been using a noSql db for a while now.

but is there a list somewhere of the worst queries?

I’m too busy to contribute to a project rn but I can optimise queries.

OsrsNeedsF2P@lemmy.ml · 1 year ago

Here’s one: https://github.com/LemmyNet/lemmy/issues/3845

They need all the help they can get

dbilitated@aussie.zone · 1 year ago

champ, I’ll have a look tonight after work. can’t guarantee I’ll be able to fix it but I’ll see if I understand it well enough to optimise.

KrisND@lemmy.world · edit-2 1 year ago

It sucks but there will always be some labor intensive queries to execute. Although, it can be limited and restricted which I’m sure they are already on top of it. Such as caching and security control put in place to make limits like “this type of request from this IP can only happen 1x per hour” or something along those lines.

If I had to guess, without looking into the source code yet and limited information provided I’d assuming it’s mass account creation, image uploading and/or exploiting how the instant syncs with the fediverse. It’s most certainly something that can be mostly prevented once the holes are made and then patched.

Also, I’m sure in the future something more efficient than SQL will be used.

solrize@lemmy.world · edit-2 1 year ago

I have to wonder what those queries actually do. Why is mass account creation a thing? Image uploading shouldn’t cause significant db activity (add a row saying where the image is, don’t put the image into a BLOB or anything like that). Syncing is no big deal either, given the quite low amount of traffic. I know that some websites use Postgres for fulltext search and I don’t know how well that works under heavy loads. I’ve mostly used Solr (solr.apache.org, thus my username) but I think that is now considered old fashioned.

PostgreSQL itself is quite performant and should be able to handle high loads once the queries and schemas are optimized, there is some caching of obvious things, etc. One antipattern I’ve noticed is pagination: saying “page=5” like Lemmy does to get to the 5th page of /all is done with an OFFSET clause which is expensive because it has to count off that many rows. It is better to use timestamps or other markers like Reddit does, that can be an indexed column that can be accessed quickly.

Anyway thanks.

Dude Canáraí@lemmy.world · 1 year ago

Thank you Dudes for all your hard work!

dustout@lemmy.world · edit-2 1 year ago

There really needs to be a SQL result caching layer as well wherever possible. Even caching things a couple seconds in redis would help mitigate ddos issues. Things like counter updates could be batched in a queue to rate limit.This is pretty basic stuff for crafting a scalable site so it seems Lemmy really needs more experienced volunteer help on the codebase.

solrize@lemmy.world · edit-2 1 year ago

Disclaimer: I’m probably behind the times.

It wouldn’t surprise me if that type of Redis caching is still done, but it seems like an antipattern since PG has plenty of caching in its own right. In the old days, the Redis or Memcached layer was because you just couldn’t get cost-effective servers with more than 32GB of ram, and SSD’s weren’t really a thing, so if you had 500GB of assets and didn’t want to serve from HDD, you’d use a memcached or redis cluster to serve from ram. These days ram and big servers are a lot more affordable, and SSD decreases DB latency by a lot, so you don’t need such kludges as much.

I don’t think Wikipedia uses a Redis layer but I could be wrong. My impression is that it has a lot of squid proxies to serve statically cached pages to viewers who are not logged in. For logged in users the pages are customized and generated dynamically from mysql queries. I believe they do have a large number of mysql slave servers to distribute those queries among, though.

I’m sure Ruud knows a lot more about PG than I do, so hopefully he and the devs are on top of this stuff.

jarfil@lemmy.world · 1 year ago

The Lemmy core devs have long ago admitted they’re no SQL experts, and were asking for help. Some people have offered some, but much more is needed.