Introduction
We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models.
Key Features
- ๐ Exceptional Efficiency and Performance: With only 6B parameters, LongCat-Image surpasses numerous open-source models that are several times larger across multiple benchmarks, demonstrating the immense potential of efficient model design.
- ๐ Powerful Chinese Text Rendering: LongCat-Image demonstrates superior accuracy and stability in rendering common Chinese characters compared to existing SOTA open-source models and achieves industry-leading coverage of the Chinese dictionary.
- ๐ Remarkable Photorealism: Through an innovative data strategy and training framework, LongCat-Image achieves remarkable photorealism in generated images.
๐จ Showcase
Quick Start
Installation
Clone the repo:
git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Image
cd LongCat-Image
Install dependencies:
# create conda environment
conda create -n longcat-image python=3.10
conda activate longcat-image
# install other requirements
pip install -r requirements.txt
python setup.py develop
Run Text-to-Image Generation
Leveraging a stronger LLM for prompt refinement can further enhance image generation quality. Please refer to inference_t2i.py for detailed usage instructions.
๐ Special Handling for Text Rendering
For both Text-to-Image and Image Editing tasks involving text generation, you must enclose the target text within single or double quotation marks (both English '...' / "..." and Chinese โ...โ / โ...โ styles are supported).
Reasoning: The model utilizes a specialized character-level encoding strategy specifically for quoted content. Failure to use explicit quotation marks prevents this mechanism from triggering, which will severely compromise the text rendering capability.
import torch
from transformers import AutoProcessor
from longcat_image.models import LongCatImageTransformer2DModel
from longcat_image.pipelines import LongCatImagePipeline
device = torch.device('cuda')
checkpoint_dir = './weights/LongCat-Image'
text_processor = AutoProcessor.from_pretrained( checkpoint_dir, subfolder = 'tokenizer' )
transformer = LongCatImageTransformer2DModel.from_pretrained( checkpoint_dir , subfolder = 'transformer',
torch_dtype=torch.bfloat16, use_safetensors=True).to(device)
pipe = LongCatImagePipeline.from_pretrained(
checkpoint_dir,
transformer=transformer,
text_processor=text_processor
)
# pipe.to(device, torch.bfloat16) # Uncomment for high VRAM devices (Faster inference)
pipe.enable_model_cpu_offload() # Offload to CPU to save VRAM (Required ~17 GB); slower but prevents OOM
prompt = 'ไธไธชๅนด่ฝป็ไบ่ฃๅฅณๆง๏ผ่บซ็ฉฟ้ป่ฒ้็ป่กซ๏ผๆญ้
็ฝ่ฒ้กน้พใๅฅน็ๅๆๆพๅจ่็ไธ๏ผ่กจๆ
ๆฌ้ใ่ๆฏๆฏไธๅ ต็ฒ็ณ็็ ๅข๏ผๅๅ็้ณๅ
ๆธฉๆๅฐๆดๅจๅฅน่บซไธ๏ผ่ฅ้ ๅบไธ็งๅฎ้่ๆธฉ้ฆจ็ๆฐๅดใ้ๅคด้็จไธญ่ท็ฆป่ง่ง๏ผ็ชๅบๅฅน็็ฅๆๅๆ้ฅฐ็็ป่ใๅ
็บฟๆๅๅฐๆๅจๅฅน็่ธไธ๏ผๅผบ่ฐๅฅน็ไบๅฎๅ้ฅฐๅ็่ดจๆ๏ผๅขๅ ็ป้ข็ๅฑๆฌกๆไธไบฒๅๅใๆดไธช็ป้ขๆๅพ็ฎๆด๏ผ็ ๅข็็บน็ไธ้ณๅ
็ๅ
ๅฝฑๆๆ็ธๅพ็ๅฝฐ๏ผ็ชๆพๅบไบบ็ฉ็ไผ้
ไธไปๅฎนใ'
image = pipe(
prompt,
height=768,
width=1344,
guidance_scale=4.5,
num_inference_steps=50,
num_images_per_prompt=1,
generator=torch.Generator("cpu").manual_seed(43),
enable_cfg_renorm=True,
enable_prompt_rewrite=True # Reusing the text encoder as a built-in prompt rewriter
).images[0]
image.save('./t2i_example.png')
- Downloads last month
- 583