Invalid SMILES are beneficial rather than detrimental to chemical lang models

eesmith · on March 30, 2024

Despite being a co-author of the DeepSMILES paper, my interest has been to apply generative methods to fuzz test SMILES parsers, so never really got into the whole SELFIES v. (Deep)SMILES debate.

Instead, I came here for a bit of sniping. It's a bit of fun to date when the research started by looking at it's data sets. The paper uses "ChEBML (version 28)"; ChEMBL 28 came out 2021, and hasn't been 'latest' for nearly three years.

blackbear_ · on March 30, 2024

I really don't see the experiments support the main claim of the paper. The only thing that they changed was how molecules are sampled from the model after training, nothing about the model itself was changed. They also didn't try the most trivial solution of removing low probability selfies and seeing if that makes any difference. Fishy

hiddencost · on March 30, 2024

Or it's just some weird property of the specific training and inference setting the author was working with.

The epistemic claims the author is making aren't supported by the work. The work is interesting for other reasons of course.