Hacker Newsnew | past | comments | ask | show | jobs | submit | ynik's commentslogin

`s[0] == 'h'` isn't sufficient to guarantee that `s[3]` can be access without a segfault, so the compiler is not allowed to perform this optimization.

If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen: https://godbolt.org/z/KjdT16Kfb

(also note you got the endianness wrong in your hand-optimized version)


> If you use `&` instead of `&&` (so that all array elements are accessed unconditionally), the optimization will happen

But then you're accessing four elements of a string that could have a strlen of less than 3. If the strlen is 1 then the short circuit case saves you because s[1] will be '\0' instead of 'e' and then you don't access elements past the end of the string. The "optimized" version is UB for short strings.


Yes, so that's why the compiler can't and doesn't emit the optimized version if you write the short circuited version - because it behaves differently for short strings.


UB doesn't exist in the processor (it does, but not here). If the compiler knows the pointer is aligned it can do the transformation.


For the compiler to know the pointer is aligned it would have to actually be aligned and there is no guarantee that it is.


This is fantastic, thanks! This is the approach I use in httpdito to detect the CRLFCRLF that terminates an HTTP/1.0 GET request, but I'm doing it in assembly.


Ooo, I'd never thought of using & like that. Interesting.

> (also note you got the endianness wrong in your hand-optimized version) Doh :-)


Matt Godbolt's talk on ray tracers, shows how effective that change can be. Think it was that talk anyway.

https://www.youtube.com/watch?v=HG6c4Kwbv4I


good ol' short circuiting


C# doesn't erase all generics; but there's also some type erasure happening: nullable reference types, tuple element names, and the object/dynamic distinction are all not present in .NET bytecode; these are only stored in attributes for public signatures, but are erased for local variable types.

C# also has huge amounts of syntactic sugar: `yield return` and `await` compile into huge state machines; `fixed` statements come with similar problems as "finally" in java (including the possibility of exponential code growth during decompilation).


Windows NT started supporting unicode before UTF-8 was invented, back when Unicode was fundamentally 16-bit. As a result, in Microsoft world, WCHAR meant "supports unicode" and CHAR meant "doesn't support unicode yet".

By the way, UTF-16 also didn't exist yet: Windows started with UCS-2. Though I think the name "UCS-2" also didn't exist yet -- AFAIK that name was only introduced in Unicode 2.0 together with UCS-4/UTF-32 and UTF-16 -- in Unicode 1.0, the 16-bit encoding was just called "Unicode" as there were no other encodings of unicode.


> Windows NT started supporting unicode before UTF-8 was invented

That's not true, UTF-8 predates Windows NT. It's just that the jump from ASCII to UCS2 (not even real UTF16) was much easier and natural and at the time a lot of people really thought that it would be enough. Java made the same mistake around the same time. I actually had the very same discussions with older die-hard win developers as late as 2015, for a lot of them 2 bytes per symbol was still all that you could possibly need.


>, UTF-8 predates Windows NT.

Windows NT started development in 1988 and the public beta was released in July 1992 which happened before Ken Thompson devised UTF-8 on a napkin in September 1992. Rob Pike gave a UTF-8 presentation at USENIX January 1993.

Windows NT general release was July 1993 so it's not realistic to replace all UCS-16 code with UTF-8 after January 1993 and have it ready in less than 6 months. Even Linux didn't have UTF-8 support in July 1993.


> public beta

Which, let's not forget, also meant an external ecosystem already developing software for it


UTF-8 was invented in 1992 and was first published in 1993. Windows NT 3.1 had its first public demo in 1991, was scheduled for release in 1992 and was released in 1993.

Technically UTF-8 was invented before the first Windows NT release, but they would have had to rework a nearly finished and already delayed OS


Also keep in mind that ISO’s official answer was UTF-1 not UTF-8, and UTF-8 wasn’t formally accepted as part of the Unicode and ISO standards until 1996. And early versions of UTF-8 still allowed the full 31 bit range of the original ISO 10646 repertoire, before it was limited to the 21 bit range of UTF-16. Also, a lot of early UTF-8 implementations were actually what we now call CESU-8, or had various other infelicities (such as accepting overlong encodings, nowadays commonly disabled as a security risk). So even in 1993, I’m not sure it was yet clear that UTF-8 was going to win.


Autovectorization / unrolling can maybe still be handed with a couple of additional tests. The main problem I see with doing branch coverage on compiled machine code is inlining: instead of two tests for one branch, you now need two tests for each function that a copy of the branch was inlined into.


Using a second partition D: is already twice as fast at small-file-writes compared to the system partition C:. This was on Windows 10 with both partitions using NTFS on the same SSD, and everything else on default settings.

This is because C: has additional file system filters for system integrity that are not present on other drives; and also because C: has 8.3 name compatibility enabled by default, but additional NTFS partitions have that disabled. (so if your filenames are longer than the 8.3 format, Windows has to write twice the number of directory entries)


Only if you measure the life cycle starting from the initial release.

Windows 10 dropped out of support only 3 years after its successor (Win11) was available; when Windows 8.1 still had 7 more years of support after its successor (Win10) was released.

There's a lot of users who never upgrade windows but instead just get whatever is the latest whenever they buy a new computer. If these people bought a new computer every 5 years, they were always fine in the past, but now for the first time run out of support (because Win10 was "the latest" for an unusually long time period).


If someone has a newer PC, they should be able to upgrade to Windows 11 for free.

If they have an older PC, then they must have already enjoyed quite some period of the 10-year Windows 10 support lifecycle.

Unless Microsoft were to go to an Apple-esque annual release cycle, regularly dropping support for older hardware, I'm not sure how they could ever manage this.


Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).


Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.

It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.

It would be pretty silly for them to explode all strings to 4-byte characters.


You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.


It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.


Python3 is specified to use arrays of 8, 16, or 32 bit units, depending on the largest code point in the string. As a result, all code points in all strings are O(1) indexable. The claim that "Python 3 internally uses UTF-32" is simply false.


> code points, which need up to 24 bits

They need at most 21 bits. The bits may only be available in multiples of 8, but the implementation also doesn't byte-pack them into 24-bit units, so that's moot.


You are not testing what you think you are testing.

"let &mut a2 = &mut a;" is pattern-matching away the reference, so it's equivalent to "let a2 = a;". You're not actually casting a mutable reference to a pointer, you're casting the integer 13 to a pointer. Dereferencing that obviously produces UB.

If you fix the program ("let a2 = &mut a;"), then Miri accepts it just fine.


But adding 1 to a pointer will add sizeof(T) to the underlying value, so you actually need to reserve more than two addresses if you want to distinguish the "past-the-end" pointer for every object from NULL.

--

While it's rare to find a platform nowadays that uses something other than a zero bit pattern for NULL as normal pointer type; it's extremely common in C++ for pointer-to-member types: 0 is the first field at the start of a struct (offset 0); and NULL is instead represented with -1.


> so you actually need to reserve more than two addresses if you want to distinguish the "past-the-end" pointer for every object from NULL.

Well, yes and no. A 4-byte int can not reside at -4, but a char could be; but no object can reside at -1. So implementations need to take care that one-past-the-end addresses never equal to whatever happens to serve as nullptr but this requirement only makes address -1 completely unavailable for the C-native objects.


Under your interpretation, neither gcc nor clang are POSIX compliant. Because in practice all these optimizing compilers will reorder memory accesses without bothering to prove that the pointers involved are valid -- the compiler just assumes that the pointers are valid, which is justified because otherwise the program would have undefined behavior.


I am not so sure. Assuming the program does not do anything undefined is sort of the worst possible take on leaving the behavior undefined in the first place. I mean the behavior was left undefined so that "something" could be done. the language standard just does not know what that "something" is. Hell the compiler could do nothing and that would make more sense.

But to make optimizations pretending it is an invariant, it can't happen, when the specification clearly says it could happen. That's wild, and I would argue out of specification.


Undefined Behavior has a very specific meaning in the C and (respectively) C++ standards. And it is this.

I think you may be misunderstand how optimization works. It's not "AHA! Undefined Behavior, I'll poke at that!".

It's more that compilers are allowed to assume that some things just "don't happen"... and can optimize based on that. And it does make sense... if you were told to optimize a thing given certain rules, how would you do it?


Actually you are right, what I said about reordering is nonsense. The compiler will definitely reorder non-aliasing accesses. There are much weaker properties that are preserved.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: