Hacker Newsnew | past | comments | ask | show | jobs | submit | Aloisius's commentslogin

Sadly, JustHTML doesn't appear to be truly passing those tests.

It looks like the code doesn't always check whether expected errors in the testsuite match the returned errors - which is rather important to ensure one isn't just incidentally getting the expected output.

So while JustHTML looks sort of right, it'll actually do things like emit errors on perfectly valid html.

Plus, the test suite isn't actually comprehensive, so if one only writes code to pass the tests, it can fail in the real world where other parsers that actually wrote against the spec wouldn't have trouble.

For instance, the html5lib-tests only tests a small number of meta charsets and as a result, JustHTML can't handle a whole slew of valid HTML5 character encodings like windows-1250 or koi8-r - which parsers like html5lib will happily handle. There's even a unit test added by the AI that ensures koi8-r doesn't work, for some reason.


I'm not seeing 100% pass rates.

    $ uv run run_tests.py --check-errors -v

    FAILED: 8337/9404 passed (88.6%), 13 skipped
It seems this the parser is creating errors even when none are expected:

    === INCOMING HTML ===
    <math><mi></mi></math>

    === EXPECTED ERRORS ===
    (none)

    === ACTUAL ERRORS ===
    (1,12): unexpected-null-character
    (1,1): expected-doctype-but-got-start-tag
    (1,11): invalid-codepoint
This "passes" because the output tree still matches the expected output, but it is clearly not correct.

The test suite also doesn't seem to be checking errors for large swaths of the html5 test suite even with --check-errors, so it's hard to say how many would pass if those were checked.


Thanks for flagging this. Found multiple errors that are now fixed:

- The quoted test comes from justhtml-tests, a custom test suite added to make sure all parts of the algorithm are tested. It is not part of html5lib-tests.

- html5lib-tests does not support control characters in tests, which is why some of the tests in justhtml-tests exist in the first place. In my test suite I have added that ability to our test runner to make sure we handle control character correctly.

- In the INCOMING HTML block above, we are not printing control characters, they get filtered away in the terminal

- Both the treebuilder and the tokenizer are outputting errors for the found control character. None of them are in the right location (at flush instead of where found), and they are also duplicate.

- This being my own test suite, I haven't specified the correct errors. I should. expected-doctype-but-got-start-tag is reasonable in this case.

All of the above bugs are now fixed, and the test suite is in a better shape. Thanks again!


Hi! The expected errors are not standardized enough for it to make sense to enable --check-errors by default. If you look at the readme, you'll see that the only thing they're checking is that the _numbers of errors_ are correct.

That said, the example you are pulling our out does not match that either. I'll make sure to fix this bug and other like it! https://github.com/EmilStenstrom/justhtml/issues/20


run_tests.py does not appear to be checking the number of errors or the errors themselves for the tokenizer, encoding or serializer tests from html5lib-tests - which represent the majority of tests.

There's also something off about your benchmark comparison. If one runs pytest on html5lib, which uses html5lib-test plus its own unit tests and does check if errors match exactly, the pass rate appears to be much higher than 86%:

    $ uv run pytest -v 
    17500 passed, 15885 skipped, 683 xfailed,
These numbers are inflated because html5lib-tests/tree-construction tests are run multiple times in different configurations. Many of the expected failures appear to be script tests similar to the ones JustHTML skips.

I've checked the numbers for html5lib, and they are correct. They are skipping a load of tests for many different reasons, one being that namespacing of svg/math fragments are not implemented. The 88% number listed is correct.

Excellent feedback. I'll have a look at the running of html5lib tests again.

Or add aria-hidden=true

I've had clipboard events and the clipboard API disabled in my browser to prevent websites from intercepting them for ages. I can't be the only one.


My take on that is that the very slim minority who does this are also likely passable through this very blunt hiring tool anyways.

> (it's from Latin "centrum": the R goes after the T, and there is no need whatsoever to revise that.)

Why does it matter how it was spelled in Latin? English is not Latin.

In the era of ubiquitous access to dictionaries, I'm not sure the benefits of having spelling reflect etymology rather than pronunciation outweigh the cost.


The first part of my argument is this: the word centrum still has a cognate in numerous modern languages, which use the TR letter order:

French: centre

Italian: centro

Czech: centrum # identical to Latin!

Swedish: centrum

[... numerous others ...]

The "TR" order of the letters in the "centrum" cognate is still alive in modern languages and their orthography, and so is even the "centrum" word itself.

The second part of my argument is that some contemporary dialects of English itself, like British and Canadian, use "centre"; using the "centre" spelling is a contemporary practice, and not a retrogression toward Latin.

The third part of my my argument is that changing "centre" to "center" is a gratuitous change that brings no benefit; it has no redeeming value.


Spelling it center provides the significant benefit of removing foreign orthography from English, making it easier to learn to read and write.

I see no value spelling it centre. That some other languages spell the word doesn't matter as they're pronouncing it without a vowel between the t and r which is rather different than English.

In French it's pronounced santr. In Italian it's sen-tro. In Czech it's tsen-troom. In Swedish it's sen-trum.

Languages that, like English, pronounce it with a vowel between the t and r? They spell it that way.

In Albanian it's qendër pronounced very close to rhotic English sen-ter.

In Norweigian it's senter (sen-ta) which is pretty close to non-rhotic English.

In Croation, it's centar (sen-tar).

In Lombard it's center.

In Swedish, the other word for center (meaning a center (place) or sports position) is spelled... center.

And even Czech, which spells it centrum, changes the spelling to center in the genitive plural, to match the pronunciation.

So even if we're going to choose spelling based on other languages, there's plenty that spell it similarly to center to argue for it in English - though I would still argue that other people are doing it isn't a compelling argument.


According to Etymonline (i.e. Douglas Harper), quite curiously, the "center" speling in English is actually older!

Quote:

The spelling with -re was popularized in Britain by Johnson's dictionary (following Bailey's), though -er is older and was used by Shakespeare, Milton, and Pope.

At the same time, it Etymonline traces the origin to Old French (14th century) which had it as centre.

Just because Milton, Pope and Shakespeare wrote "center" doesn't mean it was a good idea. The latter couldn't spell his own name the same way twice!


It's a good idea because spelling words how they're pronounced was one of the biggest achievements of the Western world and our broken orthography impairs literacy.

Only English has this insanity driven by people who simply don't like change, like the aesthetics of older spellings, or because they're a closet francophiles/latinophiles like Johnson, but try to justify it with nonsense about etymology because how weak the personal preference argument is.


You need to retain Latin and Greek spellings for interoperability with other languages.

The problem with English is that it messed up its vowels and started changing the pronunciations.

It's very helpful to newcomers to English that a word like psychology is written in a way that is similar to theirs. But, yiles, the butchered pronunciations, /saɪˈkɑːlədʒi/, is unrecognizeable.

Other languages don't have problems with old spellings. In Czech, psychologie is pronounced the way it is written, pretty much letter for letter: /psɪxologiɛ/


We really don't need to retain interoperability, but if that is truly more important than English literacy, then perhaps we should be consistent.

If we're going to argue that centre is the correct spelling because for some reason r needs to go after a consonant instead of the vowel due to etymology, then surely we should extend this:

- Filter should become filtre (Latin: filtrum)

- Trimester should be trimestre (French: trimestre, Latin: trimestris)

- Perimeter should be perimetre (Latin: perimetros)

- Diameter should be diametre (Old French: diametre, Latin: diametros)

- Chronometer, manometer and a hundred other words that end in -meter should largely be changed to -metre like chronometre, manometre, etc.

- Copper should become coppre (Latin: coprium)

- Tiger should become tigre (Anglo-Norman: tigre, Latin: tigris)

- Cylinder should be cylindre (Middle French: cylindre, Latin: cylindrus, Ancient Greek: kulindros)

- Coriander should be coriandre (Anglo-norman: coriandre, Latin: coriandrum)

- Monster should be monstre (Old French: monstre, Latin: monstrum)

- Member should be membre (Old French: membre, Latin: membrum) - along with, of course, dismembre, castmembre, membreship, etc)

I could go on and on.

In the future, we could be entring (Old French: entrer, Latin: intro) pubs while sobre (Old French: sobre, Latin: sobrius) eagre (Old French: aigre, Latin: agrus) to drink cidre (Old French: cisdre / sidre) and get plastred (Latin: plastrum from emplastrum).

We shouldn't have to alter the spelling of more than a few thousand words to proprely (Old French: propre, Latin: proprius) retain Latin and Greek spellings (or rather, a Latin transliteration of Greek given the different alphabet).


What civilized countries are we talking about?

Because paraquat was approved for use over much of the world at one point, including countries people claim require substances be "proven safe."


> What lead it to being "banned in dozens of countries all over the world, including the United Kingdom and China"?

Almost everyone who banned it did so because of acute toxicity - it requires careful handling to use safely.

Unfortunately, it was commonly used to commit suicide in many countries. In other countries, it was deaths from accidental ingestion, lung damage from unsafe handling, etc.

I don't know of any country that banned it because of a purported link to Parkinson's.


You're conflating different chemicals together.

Paraquat (what this article is about), isn't used by any people in the links you gave (golf courses, Dutch or Swiss farmers).



Each of which must be studied, and legislated, on its own.

All the golf courses where I live use grey water - water that would otherwise be dumped into oceans/estuaries/rivers/etc.

That's not really not comparable to data centers using potable water.


Even the golf course trade association only claims 10% grey water use.

Also, you're going to be shocked, data centers can cool with grey water as well. The now-cancelled Project Blue data center near Tucson was going to build and operate a wastewater pipeline and treatment plant and give it to the city, but the shouting NIMBYs prevailed anyway. The developer now intends to use air-to-air cooling, which costs more energy.


Did they say it was efficient? The "closed loop" is only one part of the system that cycles water between the heat exchanger and the building/servers.

The second part of the system is an open loop that uses water to cool the closed loop at the heat exchanger.


They implied that DCs somehow save water because of being closed loop. The closed loop is a red herring, since the outer loop dumps potable water.

Intentionally modifying a license plate in order to prevent it from being read?

The only thing I'm shocked about is that it hasn't wasn't illegal before.


Intentionally modifying a license plate in order to prevent it from being read read by a very specific privately held company's cameras that then sells that info to whomever will pay.

Exactly. I have no idea anyone would be surprised that modifying your license plate so it cannot be read would be illegal.

The plate can still be easily read by a human, just not one of the Flock cameras.

Or toll booths…

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: