Well, markdown and HTML are encoding the same information, but markdown is effec...

michaelmior · on July 23, 2024

I think one big difference between DEFLATE and most other standard compression algorithms is that they're dictionary-based. So compressing in this way, you're really messing with locality of tokens in way that is likely unrelated to the semantics of what you're compressing.

For example, adding a repeated word somewhere in a completely different part of the document could change the dictionary and the entirety of the compressed text. That's not the case with the "compression" offered by converting HTML to Markdown. This compression more or less preserves locality and potentially removes information that is semantically meaningless (e.g. nested `div`s used for styling). Of course, this is really just conjecture on my part, but I think HTML > Markdown is likely to work well. It would certainly be interesting to have a good benchmark for this.

mistercow · on July 23, 2024

Absolutely. I'm just making a more general point that "the same information in fewer tokens" does not mean "more comprehensible to an LLM". And we have more practical evidence that that's not the case, like the recent "Let's Think Dot by Dot" paper, which found that you can get many of the benefits of chain-of-thought simply by adding filler tokens to your context (if your model is trained to deal with filler tokens). For that matter, chain-of-thought itself is an example of increasing the tokens:information ratio, and generally improves LLM performance.

That's not to say that I think that converting to markdown is pointless or particularly harmful. Reducing tokens is useful for other reasons; it reduces cost, makes generation faster, and gives you more room in the context window to cram information into. And markdown is a nice choice because it's more comprehensible to humans, which is a win for debuggability.

I just don't think you can justifiably claim, without specific research to back it up, that markdown is more comprehensible to LLMs than HTML.

https://arxiv.org/abs/2404.15758

michaelmior · on July 23, 2024

I think it's a reasonable claim. But I would agree that it's worthy of more detailed investigation.