Text-to-image based object customization, aiming to generate images with the same identity (ID) as objects of interest in accordance with text prompts and reference images, has made significant progress. However, recent customizing research is dominated by specialized tasks, such as human customization or virtual try-on, leaving a gap in general object customization. To this end, we introduce AnyMaker, an innovative zero-shot object customization framework capable of generating general objects with high ID fidelity and flexible text editability. The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling. Specifically, the general ID extraction module extracts sufficient ID information with an ensemble of self-supervised models to tackle the diverse customization tasks for general objects. Then, to provide the diffusion UNet with the extracted ID as much while not damaging the text editability in the generation process, we design a global-local dual-level ID injection module, in which the global-level semantic ID is injected into text descriptions while the local-level ID details are injected directly into the model through newly added cross-attention modules. In addition, we propose an ID-aware decoupling module to disentangle ID-related information from non-ID elements in the extracted representations for high-fidelity generation of both identity and text descriptions. To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset, Multi-Category ID-Consistent (MC-IDC) dataset, with 315k text-image samples and 10k categories. Experiments show that AnyMaker presents remarkable performance in general object customization and outperforms specialized methods in corresponding tasks. Code and dataset will be released soon.
Give a reference image and a text prompt, our goal is to generate images that shares the same ID as the object of interest from the reference image, and modifies non-ID elements such as motions and backgrounds in accordance with the text prompt. To this end, we propose AnyMaker, an innovative zero-shot text-to-image customization framework for general objects. Firstly, we utilize the general ID extractor to acquire sufficient ID information from the reference image with a segmentation mask of the interested object. Next, we leverage the dual-level ID injection to endow the diffusion model with ID information at the global and local levels respectively, without damaging the text editing ability of the model. Additionally, we design a novel ID-aware decoupling module to distinguish the ID information from the coupled non-ID elements in the object representation, for higher ID fidelity and better text editing ability.
Our AnyMaker is capable of generating high-ID-fidelity images given one single reference image, and simultaneously editing non-ID elements like posture or background in accordance with text prompts.
Given a piece of clothing, the AnyMaker can generate images of the clothing worn on a person.
Our AnyMaker can generate diverse images under the guidance of text prompts, while maintaining the same identity as the object of interest in the reference image, thereby enabling the creation of a cohesive narrative.
If the category of the interested object in the text prompt and that in the reference image is not the same, our AnyMaker can merge the two and form a new ID.
The AnyMaker can generate multiple ID-consistent images with diverse non-ID elements such as motions and orientations.
The AnyMaker exhibits outstanding capabilities of high-quality customization for general objects, and even beat task-specialized methods in the specific domains, such as human customization and virtual try-on, in terms of ID fidelity and text editability.
To promot the research of the general object customization, we construct the first large-scale general ID dataset, named as Multi-Category ID-Consistent (MC-IDC) dataset. Our dataset consists of approximately 315,000 samples in total with more than 10,000 categories, covering various types such as human faces, animals, clothes, human-made tools, etc. Each sample consists of a reference image, a segmentation mask of the object of interest in the reference image, a target image, and a text caption of the target image. The reference image with its segmentation mask provides ID information, the text caption of the target image offers semantic-level guidance for generation, and the target image serves as the ground truth.
@article{kong2024anymaker,
title={AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection},
author={Kong, Lingjie and Wu, Kai and Hu, Xiaobin and Han, Wenhui and Peng, Jinlong and Xu, Chengming and Luo, Donghao and Zhang, Jiangning and Wang, Chengjie and Fu, Yanwei},
journal={arXiv preprint arXiv:2406.11643},
year={2024}
}