Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Well, markdown and HTML are encoding the same information, but markdown is effectively compressing the semantic information. This works well for humans, because the renderer (whether markdown or plaintext) decompresses it for us. Two line breaks, for example, “decompress” from two characters to an entire line of empty space. To an LLM, though, it’s just a string of tokens.

So consider this extreme case: suppose we take a large chunk of plaintext and compress it with something like DEFLATE (but in a tokenizer friendly way), so that it uses 500 tokens instead of 2000 tokens. For the sake of argument, say we’ve done our best to train an LLM on these compressed samples.

Is that going to work well? After all, we’ve got the same information in a quarter as many tokens. I think the answer is pretty obviously “no”. Not only are we using a small fraction as much time and space to process the information, but the LLM will be forced to waste a lot of that computation on decompressing the data.



I think one big difference between DEFLATE and most other standard compression algorithms is that they're dictionary-based. So compressing in this way, you're really messing with locality of tokens in way that is likely unrelated to the semantics of what you're compressing.

For example, adding a repeated word somewhere in a completely different part of the document could change the dictionary and the entirety of the compressed text. That's not the case with the "compression" offered by converting HTML to Markdown. This compression more or less preserves locality and potentially removes information that is semantically meaningless (e.g. nested `div`s used for styling). Of course, this is really just conjecture on my part, but I think HTML > Markdown is likely to work well. It would certainly be interesting to have a good benchmark for this.


Absolutely. I'm just making a more general point that "the same information in fewer tokens" does not mean "more comprehensible to an LLM". And we have more practical evidence that that's not the case, like the recent "Let's Think Dot by Dot" paper, which found that you can get many of the benefits of chain-of-thought simply by adding filler tokens to your context (if your model is trained to deal with filler tokens). For that matter, chain-of-thought itself is an example of increasing the tokens:information ratio, and generally improves LLM performance.

That's not to say that I think that converting to markdown is pointless or particularly harmful. Reducing tokens is useful for other reasons; it reduces cost, makes generation faster, and gives you more room in the context window to cram information into. And markdown is a nice choice because it's more comprehensible to humans, which is a win for debuggability.

I just don't think you can justifiably claim, without specific research to back it up, that markdown is more comprehensible to LLMs than HTML.

https://arxiv.org/abs/2404.15758


I think it's a reasonable claim. But I would agree that it's worthy of more detailed investigation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: