From the paper `..in contrast to using the global CLIP image embeddings employed by image variation methods, here we adopt the local CLIP image embeddings right before the global pooling layer, for more fine-grained human semantics encoding.` I have a few questions: 1. Can you tell me what would be the human look using global clip embedding, compared to local embedding? Do you have any images to share? 2. Given the different embedding used in SD, does that mean you effectively retrain the SD rather than finetuning it?
This issue appears to be discussing a feature request or bug report related to the repository. Based on the content, it seems to be resolved. The issue was opened by soon-yau and has received 1 comments.