More

ckennelly · on March 10, 2023

It depends on your STL implementation's representation of string: https://godbolt.org/z/nMYGYoWbq

* libstdc++ has an internal reference to its own address for the SSO. If the moved-from string was referencing its SSO buffer, the moved-to string needs to use its own address. The branch is differentiating the SSO state from a heap-allocated state.

* libc++ string move can be implemented this way, but the branch ends up happening on access to the string. It still needs to discard the old heap allocated buffer, if need-be as well.

ckennelly · on Feb 2, 2023

The default setting of max_ptes_none is also problematic.

On a stock kernel, it's 511. TCMalloc's docs recommend using max_ptes_none set to 0 for this reason: https://github.com/google/tcmalloc/blob/master/docs/tuning.m...

(Disclosure: I work on TCMalloc and authored the above doc.)

ckennelly · on Nov 28, 2021

As mentioned in that Stack Overflow post, though, things change again with FSRM (Fast Short Rep Mov).

While there are still startup costs, the overhead of calling a function (especially via a PLT) and incurring instruction cache misses is hard to demonstrate in a microbenchmark, while rep movsb encodes more compactly than many flavors of call. In an actual application though, the "slower" but smaller implementation can often win (https://research.google/pubs/pub50338.pdf and https://research.google/pubs/pub48320.pdf)

ckennelly · on Nov 13, 2021

https://research.google/pubs/pub50338.pdf goes into more depth on the mem* libc functions and principles for the implementations in llvm libc.

ckennelly · on June 25, 2020

llvm-libc's memcpy is heavily sized optimized for this reason. A dedicated instruction that is fast for all cases, is ideal, though.

ckennelly · on April 28, 2020

You can still get a benefit from sized delete, even without inlining.

ckennelly · on April 28, 2020

Dynamic linking requires calling through the PLT to get to the implementation, so there's a data dependency on determining where the code for it is.

Independent of inlining (with LTO, since the C++ language rules requires inhibit optimizing out "operator new"), the static call is far simpler.

ckennelly · on April 28, 2020

The cost in applications really does add up.

"Profiling a warehouse-scale computer" (by S. Kanev, et. al.) showed several % of CPU usage allocating and deallocating memory. While a malloc and free does have some data dependencies, the indirect jump (by dynamically linking) is an avoidable cost on the critical path by statically linking.

wahern · on April 28, 2020

Since Haswell indirect jumps have negligible additional cost. See, e.g., "Branch Prediction and the Performance of Interpreters - Don't Trust Folklore", https://hal.inria.fr/hal-01100647

Perhaps Spectre mitigations have changed things, but we're still talking about a fraction of a fraction. As others have said, the best way to improve malloc performance is to not use malloc. Second best would be avoid allocations in critical sections so the allocator doesn't trash your CPU's prediction buffers.

ckennelly · on Feb 12, 2020

This is the modern version--based on Abseil--that we use in production for practically every C++ binary. It includes a number of performance optimizations (per-CPU caches, an improved fast/slow path, and a hugepage-aware backed), along with improved telemetry (a low overhead, always-on heap profiler, for example).

gperftools includes a decade+ old copy of TCMalloc and several other things (a signal-based CPU profiler, a heap checker, etc.). The two have diverged significantly.

ckennelly · on Sept 11, 2018

They're not confidence intervals, but weather.gov's forecasts include a link to the NWS office's forecast discussion (updated every few hours). This can give you a bit of insight into forecaster uncertainty and the variation between models.

To excerpt the most recent (8:14PM EDT) NYC office's discussion:

"As a result, expect most precipitation to be focused mainly during the evening hours. Hi-resolution models have been suggesting that the overnight hours could be mainly if not entirely dry. This is due to 850 hPa warm front lifting to the north by around 6z and a dry slot moving in from the SW (DC/PHI area). For now, have just lowered pops to chance. If trends in the high resolution models hold, pops will need to be lowered further if not removed for the overnight hours with future updates."