Great and interesting work! The pretrained CLIP image encoder by image variation is for image size 224 * 224. However, the images that you used are 256 * 256. How could you use the pretrained image encoder?
This issue appears to be discussing a feature request or bug report related to the repository. Based on the content, it seems to be resolved. The issue was opened by KenChen701 and has received 2 comments.