It’s possible the model architecture influences the effectiveness of utilizing p...

It’s possible the model architecture influences the effectiveness of utilizing pretrained weights. i.e. cnns might be a good fit for this since the first portion is the feature extractor, but you might scrap the decoder and simply retrain that.

Can’t say whether the same would work with Transformer architecture, but I would guess there are some portions that could potentially be reused? (there still exists an encoder/feature extraction portion)

If you’re reusing weights from an existing model, then it seems it becomes more of a “fine-tuning” exercise as opposed to training a novel foundational model.