Special article | Visual expression in the era of intelligent creation: Research

Abstract:

This paper provides a comprehensive and systematic review of the research progress in the field of controllable image synthesis, from the classification and evaluation system to existing challenges and future research directions. First, several representative deep generative models are introduced in detail. Then, based on the different control modes, existing controllable image synthesis methods are divided into three categories: conditional controllable image synthesis methods, GAN-based inversion controllable image synthesis methods, and causal controllable image synthesis methods. Finally, several open issues and future development directions in the field of controllable image synthesis in the era of intelligent creativity are discussed.

Keywords:

Intelligent Creativity; Controllable Image Synthesis; Generative Models; Causal Representation Learning

0 Introduction

Image synthesis is a challenging field in computer vision and graphics, widely applied in various scenarios such as image generation, image-to-image translation, image editing, etc. It aims to generate target images containing specific expected content by learning the mapping relationship from a source domain (e.g., images, text, labels, or even noise) to images. In the era of the vigorous development of large-scale generative models, various advanced generative models (such as Generative Adversarial Networks, Variational Autoencoders, Flow Models, Transformers, Diffusion Models, etc.) provide unprecedented opportunities for the creation of digital art. Although generating high-resolution, high-fidelity, and diverse artistic images remains the main goal in the field of digital art, the pursuit of controllability of generated images has become a direction highly valued by artists. Taking artistic creation as an example, the controllability of digital art images is reflected in many aspects, such as image style, artistic elements, color matching, etc. This controllability can often be achieved by introducing additional information into the generative model, which can be text input, such as "generate an abstract painting with a sense of future technology"; or image input, such as style reference images, sketch images, or edge images. The core concept of controllable image synthesis methods is to allow users to intuitively guide the desired image content during the image generation or editing process, focusing on the user's ability to more accurately control content, object position and direction, background, and other aspects. For example, when users want to transform a real image into a Van Gogh style image, they can achieve image style transformation by providing Van Gogh's paintings or a brief description (e.g., "Van Gogh style"). This design of controllability enables artists to focus more on the generation of the desired image, providing creators with more powerful and precise creative tools. This not only enriches the means of digital art creation but also meets the needs of personalized and customized art, promoting innovation and development in the field of digital art. It is worth mentioning that controllable image synthesis is considered a focused image generation, focusing on specific parts of the generated image.

According to the different control modes, existing controllable image synthesis methods can be divided into three categories: conditional image synthesis methods, GAN-based inversion image synthesis methods, and causal controllable image synthesis methods. Figure 1 shows these three different controllable image generation methods from a probabilistic perspective.

Conditional controllable image synthesis methods aim to guide image generation with a set of given specific prior information (such as attribute labels, text descriptions, semantic segmentation maps, keypoints, speech, physiological signals, etc.). Deep generative models are the most popular paradigm for implementing conditional controllable image synthesis methods, especially Variational Autoencoders, GANs, Diffusion Models, and Transformers.Conditional controllable image synthesis methods generate new images through the joint distribution of known images and their corresponding conditional labels. Formally, conditional image synthesis methods model the conditional probability P(X | Y) by learning the joint distribution P(X, Y) of images X and their corresponding conditional labels Y in the training dataset. Taking digital art images as an example, users can generate art images in the corresponding style by specifying the condition Y (e.g., "abstract style"). However, such methods may fail when dealing with multivariate problems because there may be complex associations and dependencies between observed classes and unobserved variables in real-world scenarios. When we set conditions to control the observed classes, the unobserved variables may change due to their relationship with the observed classes, leading to unwanted changes in the generated images. In short, conditional models may not fully capture the interactions of all factors when dealing with complex scenes, making it difficult to achieve fine-grained controllable image synthesis.

Unlike conditional controllable image synthesis methods, GAN-based inversion methods start from the perspective of representation learning, using encoders to extract the latent representation (latent code) of images, and further use pre-trained attribute classifiers or other simple image statistical information to find semantically meaningful directions in the latent space, thereby achieving control over specific parts of the image. Still considering the aforementioned art images, GAN-based inversion methods can change the style, color, and shape, size, etc., of certain specific target objects in the art image by moving the latent code of the input art image along the semantic direction learned in the latent space.

Similar to conditional controllable image synthesis methods, GAN-based inversion methods can model the conditional distribution P(X | do(Y), Z) by learning the joint distribution P(X, Y, Z), where Z represents the latent representation, which is usually unobservable. By changing the latent code Z, users can change specific attributes or parts of the image by optimizing P(X | Z). It is worth noting that learning disentangled representations is a special case of GAN-based inversion methods, which can clearly separate the underlying structure of the image into non-intersecting parts. In other words, each dimension of the latent code Z represents a single part or attribute in the image.

Special article | 3D object generation method based on diffusion model art style

Hotspot focus | Cover of Nature: AI trains AI, the more it trains, the more absu

Hotspot focus | Nature blockbuster: AI defeats the most advanced global traditio

Special article | Visual expression in the era of intelligent creation: Research

Academic frontier | Semantic entropy exposes the illusions of LLM! New research

Tired and sleepy, driving and staying up late for a bottle? This "bottle" may da

A detailed account of the top ten cutting-edge technologies in the world.

5-nanometer smart driving chip, whole vehicle operating system... 2024 NIO Innov

Collaborative Ecology, Smooth Octa-core 2024 Hanwang Technology released three m

Completely defeated by human doctors! AI clinical decision-making is hasty and u

The aforementioned two methods are based on the assumption of matching training and testing distributions. However, due to data selection bias, the model is prone to learning unstable spurious correlations, resulting in poor diversity of generated images. Consider a generative model that aims to generate images with artistic styles from input text descriptions. During the training of the generative model, due to the correlation between the artist's works and specific text descriptions in the training data, the generative model may exhibit spurious correlations in the testing phase, that is, it is believed that specific words or themes in the text are closely related to the artistic style. This may lead to the generated images being too biased towards the style of a certain artist, while ignoring the multiple possible style elements in the user's input text. Therefore, solving this problem requires considering a set of distributions, each associated with a possible operation, to better balance diversity. Causal controllable images are obtained by exploring the causal relationships between actions and expected image entities, learning implicit causal representations in the image generation mechanism to handle these distributions. This type of method introduces a new operator do(Y) in the statistical model to represent this action, indicating the operation on Y rather than observation, and the optimization goal becomes P(X | do(Y)). These causal controllable generative models can learn causal relationships by guiding the generative model to intervene, simulate operations, and eliminate spurious correlations on image parts or attributes. In this way, users can directly manipulate variables Y to control entities in the image by explicitly understanding how different entities affect each other. Therefore, to help readers have a systematic understanding of the progress of controllable image synthesis methods in digital art, this paper comprehensively reviews and discusses the controllable image synthesis methods proposed in recent years.

1 Controllable Image Synthesis

1.1 Conditional Controllable Image Synthesis Methods

Conditional image synthesis models the conditional distribution of images given prior information to achieve controllable synthesis or editing of input images. According to the different input prior modalities, conditional controllable image synthesis methods can be divided into five categories.Translate the following text into English:

(1) Label control, including class labels, semantic segmentation maps, image layouts, scene graphs, etc.;

(2) Visual control, such as sketch images, grayscale images, edge images, low-resolution images, or partial image blocks;

(3) Text control, which synthesizes corresponding images given a text description, also known as text-to-image synthesis;

(4) Audio control, that is, different sound signals, including human speech, animal sounds, vehicle noises, etc.;

(5) Multimodal control, which uses two or more of the above four types of modal information.

Figure 2 Classification of controllable image synthesis methods

Methods based on label control typically rely on given image attributes, image layouts, semantic segmentation masks, or scene graphs to provide control information for image synthesis. However, these methods often require additional labeled data or paired training images. The process of obtaining paired training data and labeling data is very difficult, usually involving high time costs, which greatly limits the development of such methods. Visual control further promotes interactive operations and precise processing in the image synthesis process, thanks to the inherent ability of visual control to convey spatial and structural details. Unlike visual control, text control provides a more flexible way to express and interpret visual concepts, offering greater creativity and diversity for image synthesis. However, due to the potential ambiguity of text descriptions, generating images becomes unpredictable. For example, when a user provides a vague text description, such as "a futuristic cityscape," the generation model may face challenges in understanding and presenting the specific details of the futuristic feel, resulting in generated images that do not meet user expectations. Audio control methods also have similar issues. Therefore, to synthesize high-quality artistic images that are precise and controllable by integrating the advantages of various multimodal conditional information, many controllable image synthesis methods based on multimodal conditional information have emerged. These methods use a combination of various control conditions, such as edge maps + text descriptions, semantic segmentation maps + text descriptions, human posture + layout, and semantic segmentation maps + sketches, to guide the synthesis process more accurately. For example, ControlNet supports text prompts and additional input conditions, such as edge maps, segmentation maps, key points, etc., to precisely control image synthesis; GLIGEN uses gated self-attention layers to process conditions, inputting new conditional information (such as bounding box information) into the pre-trained model to improve quality and controllability; there are also methods that can generate a combination of various output forms such as language, images, videos, or audio based on different input methods.1.2 Controllable Image Synthesis Method Based on GAN Inversion

Figure 2 (b) illustrates the process of the GAN inversion-based method, including how to obtain the latent encoding of real images, how to find meaningful directions in the GAN space (including latent space and parameter space), and how to achieve controllable image generation, which are three key modules.

(1) Latent Encoding Acquisition. Existing methods for latent encoding acquisition can be roughly divided into three categories: optimization-based methods, encoder-based methods, and hybrid methods. Optimization-based methods obtain the optimal latent encoding by minimizing the difference between the given image and the reconstructed image, which can achieve high image reconstruction quality. However, this optimization problem is highly non-convex and prone to falling into local optimal solutions, making it impossible to reconstruct any image by optimizing a single latent encoding. Encoder-based methods obtain the latent code of real images by learning an additional encoder, which is more convenient but difficult to achieve high-fidelity reconstruction images. Therefore, many methods combine the above two methods (referred to as hybrid methods), first using the encoder to obtain the initial latent encoding; then optimizing the initial latent encoding to obtain the optimal latent encoding that can accurately reconstruct the source image. This method greatly reduces the difficulty of obtaining the initial latent encoding based on the optimization method, while ensuring the quality of the reconstructed image.

(2) GAN Space Exploration. According to whether supervision information is used, existing latent space exploration methods can be divided into two types: supervised methods and unsupervised methods. Supervised methods usually randomly draw a large number of latent encodings, and then use a pre-trained generator to synthesize a set of images to build a labeled dataset for training classifiers in the latent space. For example, InterFaceGAN trains a separate support vector machine to obtain a linear hyperplane for binary attributes in the latent space, and then uses the obtained hyperplane to achieve the purpose of image attribute manipulation. However, such methods rely on predefined classifiers, limiting the flexibility of image editing. Unsupervised methods have also achieved exciting results. For example, Harkonen et al. (2021) use principal component analysis to find important directions in the GAN latent space; Shen et al. (2021) proposed a closed factor decomposition algorithm to discover semantic meanings in the latent space by directly decomposing the weights of the pre-trained generator. This method does not rely on data sampling and model training, and discovers semantically meaningful editing directions in the latent space. However, such methods often cannot achieve high-precision image editing and are difficult to achieve real-time interactive image synthesis.

(3) Controllable Image Synthesis. Existing methods usually input the edited latent code (edit the latent encoding along the learned semantic direction) into a ready-made, well-trained generator to obtain high-resolution, high-fidelity synthetic images. Common pre-trained generators include BigGAN, PGGAN, StyleGAN, StyleGAN2, etc. It is worth noting that the use of pre-trained generators limits the expressive power of GAN inversion-based methods, making the generated images lack diversity.1.3 Causal Controllable Image Synthesis Methods

Causal controllable synthesis methods aim to generate more reasonable images by modeling the causal relationships between various attributes in an image. These methods acknowledge the interdependence of image attributes, thereby producing more rigorous attribute variations and enhanced controllability. Taking artistic image generation as an example, it is clear that various attributes in artistic images are not independent of each other, such as different colors, brushstrokes, and canvas textures, all of which can reflect different emotions and artistic styles. Conditional controllable methods and GAN-based inversion methods often assume that attributes are independent of each other, resulting in unreasonable variations in the generated images. Causal controllable image generation methods, on the other hand, consider the causal relationships between attributes and allow for causal intervention operations and counterfactual image generation. For instance, in the case of a portrait, a smile leads to an open mouth and smaller eyes, that is, eye shape <- smile -> mouth shape. In causal controllable image generation methods, changes in the smile (causal attribute) will cause changes in the mouth and eye shapes (result attributes); conversely, changes in the mouth and eyes (result attributes) will not lead to changes in the smile (causal attribute). Figure 2 (c) shows two typical causal controllable image synthesis methods.

Based on whether a causal graph or causal order is given as a model prior, existing causal controllable generation methods can be divided into two types: methods based on causal priors and methods based on causal representation learning. The former method uses a given causal graph to learn a causal generative model for causal controllable image synthesis, such as CausalGAN, DEAR. These methods rely on expert knowledge to specify the causal graph in advance, but many causal relationships in reality are difficult for humans to define. The other type of method uses causal representation learning to learn the causal relationships between the latent representations of images from the data, such as CausalVAE, CCIG. These methods not only achieve causal controllable image generation but also allow for intervention operations on the learned latent encodings to generate counterfactual images. However, the performance of this method is highly dependent on the quality of the learned causal graph; in other words, if the learned causal relationships of the latent representations are not ideal, the reasonableness of the synthesized images will be affected.

2 Open Questions and Future Directions

Although controllable image synthesis methods have made significant progress in the era of intelligent creativity and have shown good performance, there are still many challenges in practical applications.

Limited model scalability. Due to the diversity of artistic image types and the different data distributions between different datasets, controllable image synthesis models usually need to be trained on various datasets, resulting in a huge waste of computing resources. One way to solve this problem is to train more universal basic models to increase data volume, enrich image categories, and reduce data distribution bias, thereby improving the scalability of the basic model.The lack of a unified image quality evaluation metric. Although there are many image quality evaluation metrics used to assess the quality of synthetic images, such as SSIM, PSNR, etc., these metrics often depend on the presence of the source image, but in reality, it is very difficult to obtain the corresponding source image for synthetic images. Other metrics, such as FID and IS, can evaluate the clarity and diversity of generated images, but it is difficult to quantify whether the synthetic artistic images can meet the user's expected effects. Therefore, most synthetic methods still use subjective evaluation to assess the quality of synthetic images. How to design a unified image editing quality evaluation system to more objectively and comprehensively evaluate the quality of synthetic images remains a major challenge for the future.

Multimodal controllable image generation. Existing controllable artistic image generation methods usually design specific methods for each control mode (such as text control, voice control, etc.), and most methods can only use one control mode. There are few methods that can combine multiple types of control modes for image synthesis at the same time. How to integrate multiple control modes from different modes into a unified framework to achieve more flexible controllable artistic image synthesis is a research direction worth studying in the future. To achieve this goal, it is necessary to create a large-scale multimodal dataset, which includes data annotations from multiple modalities (semantic segmentation masks, text descriptions, voice descriptions, sketches, depth maps, etc.).

Ethical issues and risks. With the development of artificial intelligence image generation technology, concerns about the potential misuse of generated images are increasing. For example, there may be issues such as forged artworks, copyright disputes, privacy and cultural sensitivity. In addition, controllable image synthesis is a highly data-driven task, which may lead to models trained on large-scale, single datasets amplifying biases in the dataset, bringing ethical risks. Therefore, people are working hard to study the visual manipulation produced by large-scale models to distinguish between generated images and real images and trace them back to their source models. In addition, corresponding policies and ethical guidelines should also be formulated to ensure the responsible application of AIGC-generated artistic image technology.

(References omitted)

Huang Shanshan

Doctoral candidate at Chongqing University. Main research directions include causal representation learning and image generation.Professor Liu Li

Professor at Chongqing University; selected as one of the most influential scholars in artificial intelligence globally for the year 2022 AI2000. Main research directions include causal analysis and human-computer interaction.