scvreadgo commited on
Commit
6c887fa
·
verified ·
1 Parent(s): 274305b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -30,7 +30,7 @@ This repository provides the official models for the following paper:
30
  [**Xin Lu**](https://scholar.google.com/citations?user=mFC0wp8AAAAJ)    
31
  [**Min Jin Chong**](https://scholar.google.com/citations?user=9y0SUtEAAAAJ&hl=en)<br />
32
 
33
- **Email**: sun8878232@gmail.com
34
 
35
  > **Abstract:** *Recent progress in text-to-image (T2I) diffusion models has greatly improved image quality and flexibility. However, a major challenge in personalized generation remains: preserving the subject’s identity (ID) while allowing diverse visual changes. We address this with a new framework for ID-preserving image generation. Instead of relying on adapter modules to inject identity features into pre-trained models, we propose a unified multi-modal encoding strategy that jointly captures identity and text information. Our method, called FaceCLIP, learns a shared embedding space for facial identity and textual semantics. Given a reference face image and a text prompt, FaceCLIP produces a joint representation that guides the generative model to synthesize images consistent with both the subject’s identity and the prompt. To train FaceCLIP, we introduce a multi-modal alignment loss that aligns features across face, text, and image domains. We then integrate FaceCLIP with existing UNet and Diffusion Transformer (DiT) architectures, forming a complete synthesis pipeline FaceCLIP-x. Compared to existing ID-preserving approaches, our method produces more photorealistic portraits with better identity retention and text alignment. Extensive experiments demonstrate that FaceCLIP-x outperforms prior methods in both qualitative and quantitative evaluations.*
36
 
 
30
  [**Xin Lu**](https://scholar.google.com/citations?user=mFC0wp8AAAAJ)&nbsp;&nbsp;&nbsp;&nbsp;
31
  [**Min Jin Chong**](https://scholar.google.com/citations?user=9y0SUtEAAAAJ&hl=en)<br />
32
 
33
+ **Email**: zichuan.liu@bytedance.com
34
 
35
  > **Abstract:** *Recent progress in text-to-image (T2I) diffusion models has greatly improved image quality and flexibility. However, a major challenge in personalized generation remains: preserving the subject’s identity (ID) while allowing diverse visual changes. We address this with a new framework for ID-preserving image generation. Instead of relying on adapter modules to inject identity features into pre-trained models, we propose a unified multi-modal encoding strategy that jointly captures identity and text information. Our method, called FaceCLIP, learns a shared embedding space for facial identity and textual semantics. Given a reference face image and a text prompt, FaceCLIP produces a joint representation that guides the generative model to synthesize images consistent with both the subject’s identity and the prompt. To train FaceCLIP, we introduce a multi-modal alignment loss that aligns features across face, text, and image domains. We then integrate FaceCLIP with existing UNet and Diffusion Transformer (DiT) architectures, forming a complete synthesis pipeline FaceCLIP-x. Compared to existing ID-preserving approaches, our method produces more photorealistic portraits with better identity retention and text alignment. Extensive experiments demonstrate that FaceCLIP-x outperforms prior methods in both qualitative and quantitative evaluations.*
36