Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interestingly, "aggregate first, join later" has been the standard way of joining fact tables in BI tools for a long time. Since fact tables are typically big and also share common dimensions, multi-fact joins for drill-across are best done by first aggregating on those common dimensions, then joining on them.

Makes you wonder how many cases there are out there of optimizations that feel almost second nature in one domain, but have never been applied to other domains because no one thought of it.



Something I really appreciate about PostgreSQL is that features don't land in a release until they are rock solid.

I don't think it's that nobody thought of it for PostgreSQL - I think it's that making sure it worked completely reliably across the entire scope of existing PostgreSQL features to their level of required quality took a bunch of effort.


It's not that nobody thought of it. Group pushdown has been a thing in papers for ~10 years at least, but it's hard to plan; your search space (which was already large) explodes, and it's always hard to know exactly how many rows come out of a given grouping. I have no idea how Postgres deals with these. Hopefully, they're doing something good (enough) :-)

Next up would hopefully be groupjoin, where you combine grouping and hash join into one operation if they are on the same or compatible keys (which is surprisingly often).


I wonder if PG will ever implement plan caching like MSSQL so that the speed of the optimizer is less of a concern and it can take more time finding better plans rather than replanning on every execution of the same statement.


Postgres used to have plan caching inside the same session, and that was so disastrous that it was limited severely by default.

Plan caching is very much a two-edged sword; cache too aggressively, and the situation will be different between the runs. Cache too little, and your hit rates are useless.


Not sure how that makes sense, if the stats change significantly then caches would be evicted during the gathering of statistics.

I believe popular connection poolers and clients attempt to do plan caching through prepared statements and keeping the connection open.

My understanding its not easy to do in PG since connections are process based instead of thread based and the query plans are not serializable between processes, so they cannot be shared between connections.

MSSQL has been doing statement plan caching for at least 20 years and it did stored procedure plan caching before that.


Probably quite a lot, being a specialist in multiple domains is getting more difficult.


It's not about not knowing about an optimization. The challenge is to know when to apply it, so that it does not cause regressions for cases that can't benefit from it. It may be less risky in specialized systems, like BI systems typically don't need to worry about regressing OLTP workloads. Postgres absolutely needs to be careful of that.

I believe that's one of the reasons why it took about ~8 years (the original patch was proposed in 2017).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: