1 Introduction
There have been recent advances in computer vision and graphics, enabling photorealistic images to be created. However, it still requires considerable skill or effort to create a pixellevel detailed image from scratch. Deep generative models, such as generative adversarial networks (GANs)
[12]and variational autoencoders (VAEs)
[21, 43], have recently emerged as powerful models to alleviate this difficulty. Although these models make it possible to generate various images with high fidelity quality by changing (e.g., randomly sampling) latent variables in the generator or decoder input, there still remains a painstaking process to create the desired image because the naive formulation does not impose any structure on latent variables; as a result, they may be used by the generator or decoder in a highly entangled manner. This causes difficulty in interpreting the “meaning” of the individual variables and in controlling image generation by operating each one.When we create an image from scratch, we typically select and narrow a target to paint in a coarsetofine manner. For example, when we create an image of a face with glasses, we first roughly consider the type of glasses, e.g., transparent glasses/sunglasses, then define the details, e.g., thin/thick rimmed glasses or small/big sunglasses. To use a deep generative model as a supporter for creating an image, we believe that such hierarchically interpretable representation is the key to obtaining the image one has in mind.
These facts motivated us to address the problem of how to derive hierarchically interpretable representations in a deep generative model. To solve this problem, we propose the decision tree latent controller GAN (DTLCGAN), an extension of the GAN that can learn hierarchically interpretable representations without relying on detailed supervision. Figure 1 shows examples of image generation under control using the DTLCGAN. If semantic features are represented in a hierarchically disentangled manner, we can approach a target image gradually and interactively.
To impose a hierarchical inclusion structure on latent variables, we incorporate a new architecture called the DTLC into the generator input. The DTLC has a multiplelayer tree structure in which the ON or OFF of the child node codes is controlled by the parent node codes. By using this architecture hierarchically, we can obtain the latent space in which the lower layer codes are selectively used depending on the higher layer codes.
On the problem of making the latent codes capture salient semantic features of images in a hierarchically disentanglede manner in the DTLC, the main difficulty is that we need to discover representations disentangled in the following three stages: (1) disentanglement between the control target (e.g., glasses) and unrelated factors (e.g., identity); (2) coarsetofine disentanglement between layers, i.e., the higher layer codes capture rough categories, while the lower layer ones capture detailed categories; and (3) innerlayer disentanglement to control semantic features independently, i.e., when one code captures a semantic feature (e.g., thin glasses), another one captures a different semantic feature (e.g., thick glasses).
A possible solution would be to collect detailed annotations, the amount of which is large enough to solve the problems in a fully supervised manner. However, this approach incurs high annotation costs. Even though we have enough human resources, defining the detailed categories remains a nontrivial task. The latter problem is also addressed in the field of research concerned with attribute representations [36, 58] and is still an open issue. This motivated us to tackle a challenging condition in which hierarchically interpretable representations need to be learned without relying on detailed annotations. Under this condition, it is not trivial to solve all the above three disentanglement problems at the same time because they are not independent from each other but are interrelated. To mitigate these problems, we propose a hierarchical conditional mutual information regularization (HCMI), which is an extension of MI [4] and conditional MI (CMI) [18] to hierarchical conditional settings and optimize it with a newly defined curriculum learning [3] method that we also propose. This makes it possible to discover hierarchically interpretable representations in a layerbylayer manner on the basis of information gain by only using a single DTLCGAN model. This is noteworthy because we can learn expressive representations without large increase in calculation cost. Figure 2 shows typical examples on CIFAR10, where we succeeded in learning expressive representations, i.e., categories, are learned in a weakly supervised (i.e., only 10 class labels are supervised) manner. We evaluated our DTLCGAN on various datasets, i.e., MNIST, CIFAR10, Tiny ImageNet, 3D Faces, and CelebA, and confirmed that it can learn hierarchically interpretable representations with either unsupervised or weakly supervised settings. Furthermore, we applied our DTLCGAN to imageretrieval tasks and showed its effectiveness in representation learning.
Contributions:
Our contributions are summarized as follows. (1) We derive a novel functionality in a deep generative model, which enables semantic features of an image to be controlled in a coarsetofine manner. (2) To obtain this functionality, we incorporate a new architecture called the DTLC into a GAN, which imposes a hierarchical inclusion structure on latent variables. (3) We propose a regularization called the HCMI and optimize it with a newly defined curriculum learning method that we also propose. This makes it possible to learn hierarchically disentangled representations only using a single DTLCGAN model without relying on detailed supervision. (4) We evaluated our DTLCGAN on various datasets and confirmed its effectiveness in image generation and imageretrieval tasks. We provide supplementary materials including demo videos at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/dtlcgan/.
2 Related Work
Deep Generative Models:
In computer vision and machine learning, generative image modeling is a fundamental problem. Recently, there was a significant breakthrough due to the emergence of deep generative models. These models roughly fall into two approaches: deterministic and stochastic. On the basis of deterministic approaches, Dosovitsky et al.
[8] proposed a deconvolution network that generates 3D objects, and Reed et al. [42] and Yang et al. [57] proposed networks that approximate functions for image synthesis. There are three major models based on stochastic approaches. One is a VAE [21, 43], which is formulated as probabilistic graphical models and optimized by maximizing a variational lower bound on the data likelihood. Another is an autoregressive model
[49], which breaks the data distribution into a series of conditional distributions and uses neural networks to model them. The other is a GAN
[12], which is composed of generator and discriminator networks. The generator is optimized to fool the discriminator, while the discriminator is optimized to distinguish between real and generated data. This minmax optimization makes the training procedure unstable, but several techniques [1, 2, 14, 31, 38, 45, 61] have recently been proposed to stabilize it. All these models have pros and cons. In this paper, we take a stochastic approach, particularly focusing on a GAN, and propose an extension to it because it has flexibility on latent variable design. Extension to other models remains a promising area for future work.Disentangled Representation Learning:
In the study of stochastic deep generative models, there have been attempts to learn disentangled representations similar to our attempt. Most of the studies addressed the problem in supervised settings and incorporated supervision into the networks. For example, attribute/class labels [33, 35, 48, 50, 55, 60], text descriptions [30, 40, 59], and object location descriptions [39, 41] are used as supervision. To reduce the annotation cost, extensions to semisupervised settings have also recently been proposed [20, 45, 46]. The advantage of these settings is that disentangled representation can be explicitly learned following the supervision; however, the limitation is that learnable representations are restricted to supervision. To overcome this limitation, weakly supervised [18, 23, 29, 32] and unsupervised [4]
models have recently been proposed, which discover meaningful hidden representation without relying on detailed annotations; however, these models are limited to discovering onelayer hidden representations, whereas the DTLCGAN enables multilayer hidden representations to be learned. We further discussed the relationship to the previous GANs in Section
4.4.Hierarchical Representation:
The other related topic is hierarchical representation. Previous studies have decomposed an image in various ways. The LAPGAN [6] and StackGAN [59] deconstruct an image by repeatedly downsampling it, SGAN [52] decomposes the generative process to structure and style, VGAN [51] decomposes a video into foreground and background, and SGAN [15] learns multilevel representations in feature spaces of intermediate layers. Other studies [11, 13, 16, 24, 56] used recursive structures to draw images in a stepbystep manner. The main difference from these studies is that they derive hierarchical representations in a pixel space or feature space to improve the fidelity of an image, while we derive those in a latent space to improve the interpretability and controllability of latent codes. More recently, Zhao et al. [62] proposed an extension of a VAE called the VLAE to learn multilayer hierarchical representations in a latent space similar to ours; however, the type of hierarchy is different from ours. They learn representations that are semantically independent among layers, whereas we learn those where lower layer codes are correlated with higher layer codes in a decision tree manner. We argue that such representation is necessary to learn categoryspecific features and control image generation in a selectandnarrow manner.
3 Background: GAN
A GAN [12] is a framework for training a generative model using a minmax game. The goal is to learn generative distribution that matches the real data distribution . It consists of two networks: a generator that transforms noise into data space , and a discriminator
that assigns probability
when is a sample from and assigns probability when it is a sample from . The is a prior on . The and play a twoplayer minmax game with the following binary cross entropy:(1) 
The
attempts to find the binary classifier for discriminating between true and generated data by maximizing this loss, whereas the
attempts to generate data indistinguishable from the true data by minimizing this loss.4 DtlcGan
4.1 Dtlc
In the naive GAN, latent variables are sampled from an unconditional prior and do not have any constraints on a structure. As a result, they may be used by the in a highly entangled manner, causing difficulty in interpreting the “meaning” of the individual variables and in controlling image generation by operating each one. Motivated by this fact, we incorporate the DTLC into the generator input to impose a hierarchical inclusion structure on latent variables.
Notation:
In the DTLCGAN, the latent variables are decomposed into multiple levels. We first decompose the latent variables into two parts: , which is a latent code derived from an layer DTLC and will target hierarchically interpretable semantic features, and , which is a source of incompressible noise that covers factors that are not represented by . To derive , the DTLC has a multiplelayer tree structure and is composed of layer codes . In each layer, is decomposed into node codes . To impose a hierarchical inclusion relationship between the th and th layers, an th parent node code is associated with child node codes , where . By this definition, is calculated as .
We can use both discrete and continuous variables as , but for simplicity, we treat the case in which parent node codes are discrete and the lowest layer codes are either discrete or continuous. In this case, is represented as a
dimensional onehot vector and each dimension
is associated with one child node code .Sampling Scheme:
In a training phase, we sample latent codes as follows. We illustrate a sampling example in Figure 3.

We sample from categorical distribution . We sample
in a similar manner in the discrete case, while we sample it from uniform distribution
in the continuous case. Note that, if we have supervision for , we can directly use it instead of sampling. 
To impose a hierarchical inclusion structure, we sample from conditional prior , where . We do this with the following process:
(2) where is the th dimension of . This equation means that a parent node code acts as a child node selector controlling the ON or OFF of a child node code.

By executing Step 2 recursively from the highest layer to the lowest layer, we can sample with layer hierarchical inclusion constraints. We add it to the generator input and use it with to generate an image: .
4.2 Hcmi
The DTLC imposes a hierarchical inclusion structure on latent variables; however, its constraints are not sufficient to correlate latent variables with semantic features of images. To solve this problem without relying on detailed supervision, we propose a hierarchical conditional mutual information regularization (HCMI), which is an extension of MI [4] and conditional MI (CMI) [18] to hierarchical conditional settings. In particular, we use different types of regularization for the second layer to the th layer, which have parent node codes, and the first layer, which does not have those.
Regularization for Second Layer to th Layer:
In this case, we need to discover semantic features in a hierarchically restricted manner; therefore, we maximize mutual information between thlayer child node code and image conditioned on thlayer parent node code : . For simplicity, we denote and as and , respectively. In practice, exact calculation of this mutual information is difficult because it requires calculation of the intractable posterior . Therefore, following previous studies [4, 18], we instead calculate its lower bound using an auxiliary distribution approximating :
(3) 
For simplicity, we fix the distribution of and treat as constant. In practice, is parametrized as a neural network and we particularly denote the network for as , where . Thus, the final objective function is written as
(4) 
The attempts to discover the specific semantic features that correlate with in terms of conditional information gain by maximizing this objective. We calculate for every child node code . We denote the summation in the th layer as . We use this objective with tradeoff parameter .
Regularization for First Layer:
The above regularization is useful for the codes that have parent node codes; however, the firstlayer codes do not have those; thus, we instead use a different regularization for them. Fortunately, this singlelayer case has been addressed in previous studies [4, 35] and we use one of them depending on the supervision setting. In an unsupervised setting, we use the MI [4] written as
(5) 
In a weakly supervised setting, we use an auxiliary classifier regularization (AC) [35] written as
(6) 
Note that the first term is the same as , and the added second term acts as supervision regularization. We use these objectives with tradeoff parameter .
Full Objective:
Our full objective is written as
(7) 
This is minimized for the and and maximized for the .
4.3 Curriculum Learning
The HCMI works well when the higher layer codes are already known; however, we assume the condition in which detailed annotations are not provided in advance. As a result, the network may confuse between innerlayer and intralayer disentanglement at the beginning of training. To mitigate this problem, we developed our curriculum learning method. In particular, we define a curriculum for regularization and sampling. We illustrate an example of the proposed curriculum learning method in Figure 4.
Curriculum for Regularization:
As a curriculum for regularization, we do not use the whole regularization in Equation 4.2 at the same time, instead, we add the regularization from the highest layer to the lowest layer in turn according to the training phase. In an unsupervised setting, we first learn with then add in turn. In a weakly supervised setting, we first learn with then add in turn. We use different curricula between these two settings because in a weakly supervised setting, we already know the firstlayer codes; thus, we can start from learning the secondlayer codes.
Curriculum for Sampling:
In learning the higher layer codes, instability caused by random sampling of the lower layer codes can degrade the learning performance. Motivated by this fact, we define a curriculum for sampling. In particular, in learning the higher layer codes , we fix and set the average value to the lower layer codes , e.g., set for discrete code and set for continuous code .
4.4 Relationship to Previous GANs
The DTLCGAN is a general framework, and we can see it as a natural extension of previous GANs. We summarize this relationship in Table 1. In particular, the InfoGAN [4] and CFGAN [18]^{1}^{1}1Strictly speaking, the CFGAN is formulated as an extension of the CGAN, while the weakly supervised DTLCGAN is formulated as an extension of the ACGAN. Therefore, these two models do not have completely the same architecture; however, they share the similar motivation. are highly related to the DTLCGAN in terms of discovering hidden representations on the basis of information gain; however, they are limited to learning onelayer hidden representation. We developed our DTLCGAN, HCMI, and curriculum learning method to overcome this limitation.
5 Implementation
We designed the network architectures and training scheme on the basis of techniques introduced for the InfoGAN [4]. The and share all convolutional layers, and one fully connected layer is added to the final layer for . This means that the difference in the calculation cost for the GAN and DTLCGAN is negligibly small. For discrete code , we represent as softmax nonlinearity. For continuous code , we treat as a factored Gaussian.
In most of the experiments we conducted, we used typical DCGAN models [38] and did not use the stateoftheart GAN training techniques to evaluate whether the DTLCGAN works well without relying on such techniques. However, our contributions are orthogonal to these techniques; therefore, we can improve image quality easily by incorporating these techniques to our DTLCGAN. To demonstrate this, we also tested the DTLCWGANGP (our DTLCGAN with the WGANGP ResNet [14]) as discussed in Section 6.3. The details of the experimental setup are given in Section B of the appendix.
6 Experiments
We conducted experiments on various datasets, i.e., MNIST [26], CIFAR10 [22], Tiny ImageNet [44], 3D Faces [37], and CelebA [27], to evaluate the effectiveness and generality of our DTLCGAN.^{2}^{2}2Due to the limited space, we provide only the important results in this main text. Please refer to the appendix for more results. We first used the MNIST and CIFAR10 datasets, which are widely used in this field, to analyze the DTLCGAN qualitatively and quantitatively. In particular, we evaluated the DTLCGAN in an unsupervised setting on the MNIST dataset and in a weakly supervised setting on the CIFAR10 dataset (Section 6.1 and 6.2, respectively). We tested the DTLCWGANGP on the CIFAR10 and Tiny ImageNet datasets to demonstrate that our contributions are orthogonal to the stateoftheart GAN training techniques (Section 6.3). We used the 3D Faces dataset to evaluate the effectiveness of the DTLCGAN with continuous codes (Section 6.4) and evaluated it on imageretrieval tasks using the CelebA dataset (Section 6.5). Hereafter, we denote the DTLCGAN with an th layer DTLC as the DTLCGAN and DTLCGAN in a weakly supervised setting as the DTLCGAN.
6.1 Unsupervised Representation Learning
We first analyzed the DTLCGAN in unsupervised settings on the MNIST dataset, which consists of photographs of handwritten digits and contains 60,000 training and 10,000 test samples.
Representation Comparison:
To confirm the effectiveness of the hierarchical representation learning, we compared the DTLCGAN with that in which dimensions of latent codes given to the are the same but are not hierarchical. To represent our DTLCGAN, we used the DTLCGAN, where and . In this model, , the dimension of which is , is given to the . For comparison, we used two models in which latent code dimensions are also 20 but not hierarchical. One is the InfoGAN, which has one code , and the other is the InfoGAN, which has two codes . We show the results in Figure 5. In (c), the DTLCGAN succeeded in learning hierarchically interpretable representations (in the first layer, digits, and in the second layer, details of each digit). In (a), the InfoGAN
succeeded in learning disentangled representations; however, they were learned as a flat relationship; thus, it was not trivial to estimate the higher concept (e.g., digits) from them. In (b1) and (b2), the
InfoGAN failed to learn interpretable representations. We argue that this is because and struggle to represent digit types. To clarify this limitation, we also conducted experiments on simulated data. See Section A.1 of the appendix for details.Ablation Study on Curriculum Learning:
To analyze the effectiveness of the proposed curriculum learning method, we conducted an ablation study. To evaluate quantitatively, we measured the intercategory diversity of generated images on the basis of structural similarity (SSIM) [53], which is a wellcharacterized perceptual similarity metric. This is an adhoc measure; however, recent studies [18, 35] showed that an SSIMbased measure is useful for evaluating the diversity of images generated with a GAN. Note that evaluating the quality of deep generative models is not trivial and is still an open issue due to the variety of probabilistic criteria [47]. To evaluate the th layer intercategory diversity, we measured the SSIM scores between pairs of images that are sampled from the same noise and higher layer codes but random th and lower layer codes. We calculated the scores for 50,000 randomly sampled pairs of images and took the average. The smaller value indicates that diversity is larger. We show changes in the mean SSIM scores through learning and sample images generated with varying latent codes per layer in Figure 6. We used the DTLCGAN, where and . From these results, the DTLCGAN with the full curriculum succeeded in making higher layer codes obtain higher diversity and lower layer codes obtain lower diversity, while the others failed. We argue that this is because the latter cannot avoid confusion between innerlayer and intralayer disentanglement. The qualitative results also support this fact. We also show sample images for all categories in Figures 14–16 of the appendix.
6.2 Weakly Supervised Representation Learning
We next analyzed the DTLCGAN in weakly supervised settings (i.e., only class labels are supervised) on the CIFAR10 dataset, which consists of 10 classes of images and contains 5,000 training and 1,000 test samples per class.
Ablation Study on Curriculum Learning:
We conducted an ablation study to evaluate the effectiveness of the proposed curriculum learning method in weakly supervised settings. We show changes in mean SSIM scores through learning and sample images generated with varying latent codes per layer in Figure 7. In this experiment, we used the DTLCGAN, where and . We can see the same tendency as in Figure 6. These results indicate that proposed curriculum learning method is indispensable, even in weakly supervised settings. We show samples images for all categories in Figures 17–19 of the appendix. We also conducted preference tests to analyze visual interpretability. See Section A.2 of the appendix for details.
Quantitative Evaluation:
An important concern is whether our extension degrades image quality. To address this concern, we evaluated the DTLCGAN on three metrics: inception score [45], adversarial accuracy [56], and adversarial divergence [56].^{3}^{3}3The latter two metrics require pairs of generated images and class labels to train a classifier. In our settings, a conditional generator is learned; thus, we directly used it to generate an image with a class label. We used the classifier, architecture of which was similar to the except for the output layer. We list the results in Table 2. We compared a GAN, the ACGAN, and DTLCGAN, where and . For fair comparison, we used the same network architecture and training scheme except for the extended parts. The inception scores are not stateoftheart, but in this comparison, the DTLCGAN improved upon GAN and was comparable to the ACGAN. The adversarial accuracy and adversarial divergence scores are stateofthe art, and the DTLCGAN improved upon the ACGAN. These results are noteworthy because they indicate that we can obtain expressive representation using the DTLCGAN without concern for imagequality degradation.
Inception  Adversarial  Adversarial  
Model  Score  Accuracy  Divergence 


GAN  7.09 0.09     
ACGAN  7.41 0.06  50.99 0.55  2.07 0.02 
DTLCGAN  7.39 0.03  55.10 0.48  1.82 0.03 
DTLCGAN  7.35 0.09  55.20 0.47  1.95 0.05 
DTLCGAN  7.46 0.06  56.19 0.36  1.93 0.05 
DTLCGAN  7.51 0.06  58.87 0.52  1.83 0.04 
Real Images  11.24 0.12  85.77 0.22  0 
StateoftheArt  8.59 0.12†  44.22 0.08‡  5.57 0.06‡ 
6.3 Combination with WGANGP
Another concern is whether our contributions are orthogonal to the stateoftheart GAN training techniques. To demonstrate this, we tested the DTLCWGANGP on three cases: CIFAR10 (unsupervised/weakly supervised) and Tiny ImageNet^{4}^{4}4Tiny version of the ImageNet dataset containing 200 classes 500 images. To shorten the training time, we resized images to . (unsupervised). The number of categories was same as that with the models used in Table 2. We list the results in Table 3. Interestingly, in all cases, the scores improved as the layers became deeper, and the DTLCWGANGPs achieved stateoftheart performance. We show generated image samples in Figures 20–22 of the appendix.
CIFAR10  CIFAR10  Tiny ImageNet  
Model  (Unsupervised)  (Supervised)  (Unsupervised) 


WGANGP  7.86 .07†    8.33 .11 
AC/InfoWGANGP  7.97 .09  8.42 .10†  8.33 .10 
DTLCWGANGP  8.03 .12  8.44 .10  8.34 .08 
DTLCWGANGP  8.15 .08  8.56 .07  8.41 .10 
DTLCWGANGP  8.22 .11  8.80 .08  8.51 .08 
StateoftheArt  7.86 .07†  8.59 .12‡   
6.4 Extension to Continuous Codes
To analyze the DTLCGAN with continuous codes, we evaluated it on the 3D Faces dataset, which consists of faces generated from a 3D face model and contains 240,000 samples. We compared three models, the InfoGAN, which is the InfoGAN with five continuous codes (used in the InfoGAN study [4]), InfoGAN, which is the InfoGAN with one categorical code and one continuous code , and DTLCGAN, which has one categorical code in the first layer and five continuous codes in the second layer. We show example results in Figure 8. In the InfoGANs (a, b), the individual codes tend to represent independent and exclusive semantic features because they have a flat relationship, while in the DTLCGAN (c), we can learn categoryspecific (in this case, posespecific) semantic features conditioned on the higher layer codes.
6.5 Application to Image Retrieval
One possible application of the DTLCGAN is to use hierarchically interpretable representations for image retrieval. To confirm this, we used the CelebA dataset, which consists of photographs of faces and contains 180,000 training and 20,000 test samples. To search for an image hierarchically, we measure the L2 distance between query and database images on the basis of
, , which are predicted using auxiliary functions . Figure 9 shows the results of bangsbased, glassesbased, and smilingbased image retrieval. For evaluation, we used the test set in the CelebA dataset. We trained DLTCGAN, where , , and , particularly where hierarchical representations are learned only for the attribute presence state.^{5}^{5}5We provide generated image samples in Figures 24–26 of the appendix. These results indicate that as the layer becomes deeper, images in which attribute details match more can be retrieved. To evaluate quantitatively, we measured the SSIM score between query and database images for the attributespecific areas [18] defined in Figure 10. We summarize the scores in Table 4. These results indicate that as the layer becomes deeper, the concordance rate of attributespecific areas increases.Code  Bangs  Glasses  Smiling 


0.150  0.189  0.274  
0.194  0.256  0.294  
0.211  0.265  0.326 
7 Discussion and Conclusions
We proposed an extension of the GAN called the DTLCGAN to learn hierarchically interpretable representations. To develop it, we introduced the DTLC to impose a hierarchical inclusion structure on latent variables and proposed the HCMI and curriculum learning method to discover the salient semantic features in a layerbylayer manner by only using a single DTLCGAN model without relying on detailed supervision. Experiments showed promising results, indicating that the DTLCGAN is well suited for learning hierarchically interpretable representations. The DTLCGAN is a general model, and possible future work includes applying it to other models, such as encoderdecoder models [7, 9, 21, 25, 43]
, and using it as a latent hierarchical structure discovery tool for highdimensional data.
References
 [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
 [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
 [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
 [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
 [5] H. de Vries, F. Strub, J. Mary, H. Larochelle, O. Pietquin, and A. Courville. Modulating early visual processing by language. In NIPS, 2017.
 [6] E. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a￼ Laplacian pyramid of adversarial networks. In NIPS, 2015.
 [7] J. Donahue, P. Krähenbühl, and T. Darrell. Adversarial feature learning. In ICLR, 2017.

[8]
A. Dosovitskiy, J. T. Springenberg, and T. Brox.
Learning to generate chairs with convolutional neural networks.
In CVPR, 2015.  [9] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. In ICLR, 2017.
 [10] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style. In ICLR, 2017.

[11]
S. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, and G. E. Hinton.
Attend, infer, repeat: Fast scene understanding with generative models.
In NIPS, 2016.  [12] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.

[13]
K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra.
DRAW: A recurrent neural network for image generation.
In ICML, 2015.  [14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of Wasserstein GANs. In NIPS, 2017.
 [15] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In CVPR, 2017.
 [16] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Generating images with recurrent adversarial networks. In ICLR Workshop, 2016.
 [17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [18] T. Kaneko, K. Hiramatsu, and K. Kashino. Generative attribute controller with conditional filtered generative adversarial networks. In CVPR, 2017.
 [19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [20] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In NIPS, 2014.
 [21] D. P. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2014.
 [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009.
 [23] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
 [24] H. Kwak and B.T. Zhang. Generating images part by part with composite generative adversarial networks. arXiv preprint arXiv:1607.05387, 2016.
 [25] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
 [26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proc. of the IEEE, 86(11):2278–2324, 1998.
 [27] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
 [28] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML Workshop, 2013.
 [29] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. In NIPS, 2016.
 [30] E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov. Generating images from captions with attention. In ICLR, 2016.
 [31] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley. Least squares generative adversarial networks. In ICCV, 2017.
 [32] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In NIPS, 2016.
 [33] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [34] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, 2010.
 [35] A. Odena, C. Olah, and J. Shlens. Conditional image synthesis with auxiliary classifier GANs. In ICML, 2017.
 [36] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011.

[37]
P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter.
A 3D face model for pose and illumination invariant face recognition.
In AVSS, 2009.  [38] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 [39] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In NIPS, 2016.
 [40] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In ICML, 2016.
 [41] S. Reed, A. van den Oord, N. Kalchbrenner, V. Bapst, M. Botvinick, and N. de Freitas. Generating interpretable images with controllable structure. In ICLR Workshop, 2017.
 [42] S. Reed, Y. Zhang, Y. Zhang, and H. Lee. Deep visual analogymaking. In NIPS, 2015.

[43]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, 2014.  [44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
 [45] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. In NIPS, 2016.
 [46] J. T. Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. In ICLR, 2016.
 [47] L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. In ICLR, 2016.
 [48] L. Tran, X. Yin, and X. Liu. Disentangled representation learning GAN for poseinvariant face recognition. In CVPR, 2017.
 [49] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016.
 [50] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu. Conditional image generation with pixelCNN decoders. In NIPS, 2016.
 [51] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
 [52] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In ECCV, 2016.
 [53] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. on IP, 13(4):600–612, 2004.
 [54] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. In ICML Workshop, 2015.
 [55] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016.
 [56] J. Yang, A. Kannan, D. Batra, and D. Parikh. LRGAN: Layered recursive generative adversarial networks for image generation. In ICLR, 2017.
 [57] J. Yang, S. Reed, M.H. Yang, and H. Lee. Weaklysupervised disentangling with recurrent transformations for 3D view synthesis. In NIPS, 2015.
 [58] A. Yu and K. Grauman. Just noticeable differences in visual attributes. In ICCV, 2015.
 [59] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. StackGAN: Text to photorealistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
 [60] Z. Zhang, Y. Song, and H. Qi. Age progression/regression by conditional adversarial autoencoder. In CVPR, 2017.
 [61] J. Zhao, M. Mathieu, and Y. LeCun. Energybased generative adversarial network. In ICLR, 2017.
 [62] S. Zhao, J. Song, and S. Ermon. Learning hierarchical features from generative models. In ICML, 2017.
Appendix A Additional Analysis
a.1 Representation Comparison on Simulated Data
To clarify the limitation of the InfoGANs compared in Figure 5, we conducted experiments on simulated data. In particular, we used simulated data that are hierarchically sampled in the 2D space and have globally ten categories and locally two categories. When sampling data, we first randomly selected a global position from ten candidates that are equally spaced around a circle of radius . We then randomly selected a local position from two candidates that are rotated by
radians in clockwise and anticlockwise directions from the global position. Based on this local position, we sampled data from a Gaussian distribution of a standard deviation of
.We compared models that are similar to those compared in Figure 5. As the proposed model, we used the DTLCGAN, where and . In this model, , the dimension of which is , is given to the . For comparison, we used two models in which latent code dimensions are also 20 but not hierarchical. One is the InfoGAN, which has one code , and the other is the InfoGAN, which has two codes . For DTLCGAN, we also compared the DTLCGANs with and without curriculum learning.
We show the results in Figure 11. The results indicate that InfoGANs (b, c) and DTLCGAN
without curriculum learning (d) tend to cause unbalanced or nonhierarchical clustering. In contrast, the
DTLCGAN with curriculum learning (e) succeeds in capturing hierarchical structures, i.e., the firstlayer codes captured global ten points, whereas the secondlayer codes captured local two points for each global position.a.2 Visual Interpretability Analysis
To clarity the benefit of learned representations, we conducted two XAB tests. For each test, we compared the fourthlayer models (DTLCGANs or DTLCWGANGPs) with and without curriculum learning.

Test I: Difference Interpretability Analysis
To confirm whether is more interpretable than , we compared the generated images (X) with the images generated from latent variables in which one dimension of is changed (A) and one dimension of is changed (B). The changed dimension of or was randomly chosen. We asked participants which difference is more interpretable or even. 
Test II: Semantic Similarity Analysis
To confirm whether is hierarchically interpretable, we compared the generated images (X) with the images generated from latent variables in which one dimension of is varied (A) and one dimension of is varied (B). For each case, we fixed the higher layer codes. The changed dimension of or was randomly chosen. The lower layer codes were also randomly chosen. We asked participants which is semantically similar or even.
To eliminate bias in individual samples, we showed 25 samples at the same time. To eliminate bias in the order of stimuli, the order (AB or BA) was randomly selected. We show the user interfaces in Figure 12.
We summarize the results in Tables 5. In (a) and (b), we list the results of tests I and II, respectively, using the DTLCGAN, which were used for the experiments discussed in Figure 7. The results of test I indicate that is more interpretable than regardless of curriculum learning. We argue that this is because does not have any constraints on a structure and may be used by the in a highly entangled manner. The results of test II indicate that representations learned with curriculum learning are hierarchically categorized in a better way in terms of semantics than those without it. The results support the effectiveness of the proposed curriculum learning method.
a.3 Unsupervised Learning on Complex Dataset
Although, in Section 6.1, we mainly analyzed unsupervised settings on the MNIST dataset, which is relatively simple, we can learn hierarchical representations in an unsupervised manner even in more complex datasets. However, in this case, learning targets depend on the initialization because such datasets can be categorized in various ways. We illustrate this in Figure 13. We also evaluated the DTLCWGANGP in unsupervised settings on the CIFAR10 and Tiny ImageNet datasets. See Section 6.3 for details.
Model  even  


W/o curriculum  0.0  1.0 1.0  99.0 1.0 
W/ curriculum  0.0  1.0 1.0  99.0 1.0 
*Number of collected answers is 400  
(a) Test I for DTLCGAN on CIFAR10  
Model  even  


W/o curriculum  22.4 3.9  41.3 4.6  36.2 4.5 
W/ curriculum  3.6 1.7  17.8 3.5  78.7 3.8 
*Number of collected answers is 450  
(b) Test II for DTLCGAN on CIFAR10  
Model  even  


W/o curriculum  18.0 4.4  31.3 5.3  50.7 5.7 
W/ curriculum  4.7 2.4  12.0 3.7  83.3 4.2 
*Number of collected answers is 300  
(c) Test II for DTLCWGANGP on CIFAR10  
Model  even  


W/o curriculum  21.7 4.7  38.3 5.5  40.0 5.6 
W/ curriculum  17.0 4.3  24.0 4.9  59.0 5.6 
*Number of collected answers is 300  
(d) Test II for DTLCWGANGP on CIFAR10  
Model  even  


W/o curriculum  13.2 4.2  53.6 6.2  33.2 5.9 
W/ curriculum  2.4 1.9  17.2 4.7  80.4 5.0 
*Number of collected answers is 250  
(e) Test II on DTLCWGANGP on Tiny ImageNet 
Appendix B Details on Experimental Setup
In this section, we describe the network architectures and training scheme for each dataset. We designed the network architectures and training scheme on the basis of techniques introduced for the InfoGAN [4]. The and share all convolutional layers (Conv.), and one fully connected layer (FC.) is added to the final layer for . This means that the difference in the calculation cost for the GAN and DTLCGAN is negligibly small. For discrete code , we represented as softmax nonlinearity. For continuous code , we parameterized through a factored Gaussian.
In most of the experiments we conducted, we designed the network architectures and training scheme on the basis of the techniques introduced for the DCGAN [38] and did not use the stateoftheart GAN training techniques to evaluate whether the DTLCGAN works well without relying on such techniques. To downscale and upscale, we respectively used convolutions (Conv. ) and backward convolutions (Conv.
), i.e., fractionally strided convolutions, with stride 2. As activation functions, we used rectified linear units (ReLUs)
[34] for the , while we used leaky rectified linear units (LReLUs) [28, 54] for the . We applied batch normalization (BNorm) [17] to all the layers except the generator output layer and discriminator input layer. We trained the networks using the Adam optimizer [19] with a minibatch of size . The learning rate was set to for the and to for the . The momentum term was set to .To demonstrate that our contributions are orthogonal to the stateoftheart GAN training techniques, we also tested the DTLCWGANGP (our DTLCGAN with the WGANGP ResNet [14]) discussed in Section 6.3. We used similar network architectures and training scheme as the WGANGP ResNet, except for the extended parts.
The details for each dataset are given below.
b.1 Mnist
The DTLCGAN network architectures for the MNIST dataset, which were used for the experiments discussed in Section 6.1, are shown in Table 6. As a preprocess, we normalized the pixel value to the range
. In the generator output layers, we used the Sigmoid function. We used the
DTLCGAN, where and , i.e., which has one discrete code in the first layer and discrete codes in the th layer where , , and . We added to the generator input. The tradeoff parameters were set to 0.1. We trained the networks for iterations in unsupervised settings. As a curriculum for , we added regularization and sampling after iterations.b.2 Cifar10
The DTLCGAN network architectures for the CIFAR10 dataset, which were used for the experiments discussed in Section 6.2, are shown in Table 7. As a preprocess, we normalized the pixel value to the range . In the generator output layers, we used the Tanh function. We used the DTLCGAN, where and , i.e., which has one tendimensional discrete code in the first layer and discrete codes in the th layer where , , and . We added to the generator input. We used the supervision (i.e., class labels) for . The tradeoff parameters were set to 1. We trained the networks for iterations in weakly supervised settings. As a curriculum for , we added regularization and sampling after iterations.
b.3 DtlcWganGp
The DTLCWGANGP network architectures for the CIFAR10 and Tiny ImageNet datasets, which were used for the experiments discussed in Section 6.3, are similar to the WGANGP ResNet used in a previous paper [14], except for the extended parts. We used the DTLCWGANGP, where and , i.e., which has one tendimensional discrete code in the first layer and discrete codes in the th layer where , , and . Following the ACWGANGP ResNet implementation [14], we used conditional batch normalization (CBN) [5, 10] to make the conditioned on the codes . CBN has two parameters, i.e., gain parameter and bias parameter , for each category, where . As curriculum for sampling, in learning the higher layer codes, we used and averaged over those for the related lower layer node codes.
In unsupervised settings, we sampled from categorical distribution . The tradeoff parameters were set to 1. We trained the networks for iterations. As a curriculum for , we added regularization and sampling after iterations.
In weakly supervised settings, we used the supervision (i.e., class labels) for . The were set to 1. We trained the networks for iterations. As a curriculum for , we added regularization and sampling after iterations.
b.4 3D Faces
The DTLCGAN network architectures for the 3D Faces dataset, which were used for the experiments discussed in Section 6.4, are shown in Table 8. As a preprocess, we normalized the pixel value to the range . In the generator output layers, we used the Sigmoid function. We used the DTLCGAN, where and , i.e., which has one discrete code in the first layer and five continuous codes in the second layer. We added to the generator input. The tradeoff parameters and were set to 1. We trained the networks for iterations in unsupervised settings. As a curriculum for , we added regularization and sampling after iterations.
b.5 CelebA
The DTLCGAN network architectures for the CelebA dataset, which were used for the experiments discussed in Section 6.5, are shown in Table 9. As a preprocess, we normalized the pixel value to the range . In the generator output layers, we used the Tanh function. We used the DTLCGAN, where and , particularly where hierarchical representations are learned only for the attribute presence state. Therefore, and () is calculated as . This model has one twodimensional discrete code in the first layer and discrete codes in the th layer where and . We added to the generator input. We used the supervision (i.e., an attribute label) for . The tradeoff parameters were set to 1, 0.1, and 0.04 for bangs, glasses, and smiling, respectively. We trained the networks for iterations in weakly supervised settings. As a curriculum for , we added regularization and sampling after iterations.
b.6 Simulated Data
The DTLCGAN network architectures for the simulated data used for the experiments discussed in Section A.1, are shown in Table 10. As a preprocess, we scaled the discriminator input by factor 4 (roughly scaled to range ). We used the DTLCGAN, where and , i.e., which has one discrete code in the first layer and ten discrete codes in the second layer. We added to the generator input. The tradeoff parameters and were set to 1. We trained the networks using the Adam optimizer with a minibatch of size 512. The learning rate was set to 0.0001 for and . The momentum term was set to 0.5. We trained the networks for iterations in unsupervised settings. As a curriculum for , we added regularization and sampling after iterations.

Generator 

Input 
1024 FC., BNorm, ReLU 
FC., BNorm, ReLU 
64 Conv. , BNorm, ReLU 
1 Conv. , Sigmoid 


Discriminator / Auxiliary Function 

Input 1 gray image 
64 Conv. , LReLU 
128 Conv. , BNorm, LReLU 
1024 FC., BNorm, LReLU 
FC. output for 
128 FC., BNorm, LReLUFC. output for 


Generator 

Input 
FC., BNorm, ReLU 
256 Conv. , BNorm, ReLU 
128 Conv. , BNorm, ReLU 
64 Conv. , BNorm, ReLU 
3 Conv., Tanh 


Discriminator / Auxiliary Function 

Input 3 color image 
64 Conv., LReLU, Dropout 
128 Conv. , BNorm, LReLU, Dropout 
128 Conv., BNorm, LReLU, Dropout 
256 Conv. , BNorm, LReLU, Dropout 
256 Conv., BNorm, LReLU, Dropout 
512 Conv. , BNorm, LReLU, Dropout 
512 Conv., BNorm, LReLU, Dropout 
FC. output for 
128 FC., BNorm, LReLU, Dropout 
FC. output for 


Generator 

Input 
1024 FC., BNorm, ReLU 
FC., BNorm, ReLU 
64 Conv. , BNorm, ReLU 
1 Conv. , Sigmoid 


Discriminator / Auxiliary Function 

Input 1 gray image 
64 Conv. , LReLU 
128 Conv. , BNorm, LReLU 
1024 FC., BNorm, LReLU 
FC. output for 
128 FC., BNorm, LReLUFC. output for 


Generator 

Input 
FC., BNorm, ReLU 
256 Conv. , BNorm, ReLU 
128 Conv. , BNorm, ReLU 
64 Conv. , BNorm, ReLU 
3 Conv. , Tanh 


Discriminator / Auxiliary Function 

Input 3 color image 
64 Conv. , LReLU 
128 Conv. , BNorm, LReLU 
256 Conv. , BNorm, LReLU 
512 Conv. , BNorm, LReLU 
FC. output for 
128 FC., BNorm, LReLUFC. output for 


Generator 

Input 
128 FC., ReLU 
128 FC., ReLU 
2 FC. 


Discriminator / Auxiliary Function 

Input 2D simulated data 
(scaled by factor 4 (roughly scaled to range )) 
128 FC. ReLU 
128 FC. ReLU 
FC. output for 
128 FC., ReLUFC. output for 

Comments
There are no comments yet.