LongCat-Image

Introduction

We introduce LongCat-Image, a pioneering open-source and bilingual (Chinese-English) foundation model for image generation, designed to address core challenges in multilingual text rendering, photorealism, deployment efficiency, and developer accessibility prevalent in current leading models.

LongCat-Image Generation Examples

Key Features

  • ๐ŸŒŸ Exceptional Efficiency and Performance: With only 6B parameters, LongCat-Image surpasses numerous open-source models that are several times larger across multiple benchmarks, demonstrating the immense potential of efficient model design.
  • ๐ŸŒŸ Powerful Chinese Text Rendering: LongCat-Image demonstrates superior accuracy and stability in rendering common Chinese characters compared to existing SOTA open-source models and achieves industry-leading coverage of the Chinese dictionary.
  • ๐ŸŒŸ Remarkable Photorealism: Through an innovative data strategy and training framework, LongCat-Image achieves remarkable photorealism in generated images.

๐ŸŽจ Showcase

LongCat-Image Generation Examples

Quick Start

Installation

Clone the repo:

git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Image
cd LongCat-Image

Install dependencies:

# create conda environment
conda create -n longcat-image python=3.10
conda activate longcat-image

# install other requirements
pip install -r requirements.txt
python setup.py develop

Run Text-to-Image Generation

Leveraging a stronger LLM for prompt refinement can further enhance image generation quality. Please refer to inference_t2i.py for detailed usage instructions.

๐Ÿ“ Special Handling for Text Rendering

For both Text-to-Image and Image Editing tasks involving text generation, you must enclose the target text within single or double quotation marks (both English '...' / "..." and Chinese โ€˜...โ€™ / โ€œ...โ€ styles are supported).

Reasoning: The model utilizes a specialized character-level encoding strategy specifically for quoted content. Failure to use explicit quotation marks prevents this mechanism from triggering, which will severely compromise the text rendering capability.

import torch
from transformers import AutoProcessor
from longcat_image.models import LongCatImageTransformer2DModel
from longcat_image.pipelines import LongCatImagePipeline

device = torch.device('cuda')
checkpoint_dir = './weights/LongCat-Image'

text_processor = AutoProcessor.from_pretrained( checkpoint_dir, subfolder = 'tokenizer'  )
transformer = LongCatImageTransformer2DModel.from_pretrained( checkpoint_dir , subfolder = 'transformer', 
    torch_dtype=torch.bfloat16, use_safetensors=True).to(device)

pipe = LongCatImagePipeline.from_pretrained(
    checkpoint_dir,
    transformer=transformer,
    text_processor=text_processor
)
# pipe.to(device, torch.bfloat16)  # Uncomment for high VRAM devices (Faster inference)
pipe.enable_model_cpu_offload()  # Offload to CPU to save VRAM (Required ~17 GB); slower but prevents OOM

prompt = 'ไธ€ไธชๅนด่ฝป็š„ไบš่ฃ”ๅฅณๆ€ง๏ผŒ่บซ็ฉฟ้ป„่‰ฒ้’ˆ็ป‡่กซ๏ผŒๆญ้…็™ฝ่‰ฒ้กน้“พใ€‚ๅฅน็š„ๅŒๆ‰‹ๆ”พๅœจ่†็›–ไธŠ๏ผŒ่กจๆƒ…ๆฌ้™ใ€‚่ƒŒๆ™ฏๆ˜ฏไธ€ๅ ต็ฒ—็ณ™็š„็ –ๅข™๏ผŒๅˆๅŽ็š„้˜ณๅ…‰ๆธฉๆš–ๅœฐๆด’ๅœจๅฅน่บซไธŠ๏ผŒ่ฅ้€ ๅ‡บไธ€็งๅฎ้™่€Œๆธฉ้ฆจ็š„ๆฐ›ๅ›ดใ€‚้•œๅคด้‡‡็”จไธญ่ท็ฆป่ง†่ง’๏ผŒ็ชๅ‡บๅฅน็š„็ฅžๆ€ๅ’Œๆœ้ฅฐ็š„็ป†่Š‚ใ€‚ๅ…‰็บฟๆŸ”ๅ’Œๅœฐๆ‰“ๅœจๅฅน็š„่„ธไธŠ๏ผŒๅผบ่ฐƒๅฅน็š„ไบ”ๅฎ˜ๅ’Œ้ฅฐๅ“็š„่ดจๆ„Ÿ๏ผŒๅขžๅŠ ็”ป้ข็š„ๅฑ‚ๆฌกๆ„ŸไธŽไบฒๅ’ŒๅŠ›ใ€‚ๆ•ดไธช็”ป้ขๆž„ๅ›พ็ฎ€ๆด๏ผŒ็ –ๅข™็š„็บน็†ไธŽ้˜ณๅ…‰็š„ๅ…‰ๅฝฑๆ•ˆๆžœ็›ธๅพ—็›Šๅฝฐ๏ผŒ็ชๆ˜พๅ‡บไบบ็‰ฉ็š„ไผ˜้›…ไธŽไปŽๅฎนใ€‚'

image = pipe(
    prompt,
    height=768,
    width=1344,
    guidance_scale=4.5,
    num_inference_steps=50,
    num_images_per_prompt=1,
    generator=torch.Generator("cpu").manual_seed(43),
    enable_cfg_renorm=True,
    enable_prompt_rewrite=True # Reusing the text encoder as a built-in prompt rewriter
).images[0]
image.save('./t2i_example.png')
Downloads last month
583
Inference Providers NEW

Model tree for meituan-longcat/LongCat-Image

Quantizations
1 model

Spaces using meituan-longcat/LongCat-Image 13