11
A Deep Factorization of Style and Structure in Fonts Nikita Srivatsan 1 Jonathan T. Barron 2 Dan Klein 3 Taylor Berg-Kirkpatrick 4 1 Language Technologies Institute, Carnegie Mellon University, [email protected] 2 Google Research, [email protected] 3 Computer Science Division, University of California, Berkeley, [email protected] 4 Computer Science and Engineering, University of California, San Diego, [email protected] Figure 1: Example fonts from the Capitals64 dataset. The task of font reconstruction involves generating missing glyphs from partially observed novel fonts. Abstract We propose a deep factorization model for ty- pographic analysis that disentangles content from style. Specifically, a variational infer- ence procedure factors each training glyph into the combination of a character-specific con- tent embedding and a latent font-specific style variable. The underlying generative model combines these factors through an asymmet- ric transpose convolutional process to gener- ate the image of the glyph itself. When trained on corpora of fonts, our model learns a man- ifold over font styles that can be used to an- alyze or reconstruct new, unseen fonts. On the task of reconstructing missing glyphs from an unknown font given only a small num- ber of observations, our model outperforms both a strong nearest neighbors baseline and a state-of-the-art discriminative model from prior work. 1 Introduction One of the most visible attributes of digital lan- guage data is its typography. A font makes use of unique stylistic features in a visually consistent manner across a broad set of characters while pre- serving the structure of each underlying form in order to be human readable – as shown in Figure 1. Modeling these stylistic attributes and how they compose with underlying character structure could aid typographic analysis and even allow for auto- matic generation of novel fonts. Further, the vari- ability of these stylistic features presents a chal- lenge for optical character recognition systems, which typically presume a library of known fonts. In the case of historical document recognition, for example, this problem is more pronounced due to the wide range of lost, ancestral fonts present in such data (Berg-Kirkpatrick et al., 2013; Berg- Kirkpatrick and Klein, 2014). Models that capture this wide stylistic variation of glyph images may eventually be useful for improving optical charac- ter recognition on unknown fonts. In this work we present a probabilistic latent variable model capable of disentangling stylistic features of fonts from the underlying structure of each character. Our model represents the style of each font as a vector-valued latent variable, and parameterizes the structure of each character as a learned embedding. Critically, each style latent variable is shared by all characters within a font, while character embeddings are shared by char- acters of the same type across all fonts. Thus, our approach is related to a long literature on us- ing tensor factorization as a method for disentan- gling style and content (Freeman and Tenenbaum, 1997; Tenenbaum and Freeman, 2000; Vasilescu and Terzopoulos, 2002; Tang et al., 2013) and to recent deep tensor factorization techniques (Xue et al., 2017). Inspired by neural methods’ ability to disentan- gle loosely coupled phenomena in other domains, including both language and vision (Hu et al., 2017; Yang et al., 2017; Gatys et al., 2016; Zhu et al., 2017), we parameterize the distribution that combines style and structure in order to generate arXiv:1910.00748v2 [cs.LG] 16 May 2020

arXiv:1910.00748v2 [cs.LG] 16 May 2020to the wide range of lost, ancestral fonts present in such data (Berg-Kirkpatrick et al.,2013;Berg-Kirkpatrick and Klein,2014). Models that capture

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • A Deep Factorization of Style and Structure in Fonts

    Nikita Srivatsan1 Jonathan T. Barron2 Dan Klein3 Taylor Berg-Kirkpatrick4

    1Language Technologies Institute, Carnegie Mellon University, [email protected] Research, [email protected]

    3Computer Science Division, University of California, Berkeley, [email protected] Science and Engineering, University of California, San Diego, [email protected]

    Figure 1: Example fonts from the Capitals64 dataset. The task of font reconstruction involves generating missingglyphs from partially observed novel fonts.

    AbstractWe propose a deep factorization model for ty-pographic analysis that disentangles contentfrom style. Specifically, a variational infer-ence procedure factors each training glyph intothe combination of a character-specific con-tent embedding and a latent font-specific stylevariable. The underlying generative modelcombines these factors through an asymmet-ric transpose convolutional process to gener-ate the image of the glyph itself. When trainedon corpora of fonts, our model learns a man-ifold over font styles that can be used to an-alyze or reconstruct new, unseen fonts. Onthe task of reconstructing missing glyphs froman unknown font given only a small num-ber of observations, our model outperformsboth a strong nearest neighbors baseline anda state-of-the-art discriminative model fromprior work.

    1 Introduction

    One of the most visible attributes of digital lan-guage data is its typography. A font makes useof unique stylistic features in a visually consistentmanner across a broad set of characters while pre-serving the structure of each underlying form inorder to be human readable – as shown in Figure 1.Modeling these stylistic attributes and how theycompose with underlying character structure couldaid typographic analysis and even allow for auto-matic generation of novel fonts. Further, the vari-ability of these stylistic features presents a chal-lenge for optical character recognition systems,

    which typically presume a library of known fonts.In the case of historical document recognition, forexample, this problem is more pronounced dueto the wide range of lost, ancestral fonts presentin such data (Berg-Kirkpatrick et al., 2013; Berg-Kirkpatrick and Klein, 2014). Models that capturethis wide stylistic variation of glyph images mayeventually be useful for improving optical charac-ter recognition on unknown fonts.

    In this work we present a probabilistic latentvariable model capable of disentangling stylisticfeatures of fonts from the underlying structure ofeach character. Our model represents the style ofeach font as a vector-valued latent variable, andparameterizes the structure of each character as alearned embedding. Critically, each style latentvariable is shared by all characters within a font,while character embeddings are shared by char-acters of the same type across all fonts. Thus,our approach is related to a long literature on us-ing tensor factorization as a method for disentan-gling style and content (Freeman and Tenenbaum,1997; Tenenbaum and Freeman, 2000; Vasilescuand Terzopoulos, 2002; Tang et al., 2013) and torecent deep tensor factorization techniques (Xueet al., 2017).

    Inspired by neural methods’ ability to disentan-gle loosely coupled phenomena in other domains,including both language and vision (Hu et al.,2017; Yang et al., 2017; Gatys et al., 2016; Zhuet al., 2017), we parameterize the distribution thatcombines style and structure in order to generate

    arX

    iv:1

    910.

    0074

    8v2

    [cs

    .LG

    ] 1

    6 M

    ay 2

    020

  • glyph images as a transpose convolutional neuraldecoder (Dumoulin and Visin, 2016). Further, thedecoder is fed character embeddings early on inthe process, while the font latent variables directlyparameterize the convolution filters. This archi-tecture biases the model to capture the asymmet-ric process by which structure and style combineto produce an observed glyph.

    We evaluate our learned representations on thetask of font reconstruction. After being trainedon a set of observed fonts, the system reconstructsmissing glyphs in a set of previously unseen fonts,conditioned on a small observed subset of glyphimages. Under our generative model, font recon-struction can be performed via posterior inference.Since the posterior is intractable, we demonstratehow a variational inference procedure can be usedto perform both learning and accurate font recon-struction. In experiments, we find that our pro-posed latent variable model is able to substantiallyoutperform both a strong nearest-neighbors base-line as well as a state-of-the-art discriminative sys-tem on a standard dataset for font reconstruction.Further, in qualitative analysis, we demonstratehow the learned latent space can be used to inter-polate between fonts, hinting at the practicality ofmore creative applications.

    2 Related Work

    Discussion of computational style and contentseparation in fonts dates at least as far back as thewritings of Hofstadter (1983, 1995). Some priorwork has tackled this problem through the use ofbilinear factorization models (Freeman and Tenen-baum, 1997; Tenenbaum and Freeman, 2000),while others have used discriminative neural mod-els (Zhang et al., 2018, 2020) and adversarialtraining techniques (Azadi et al., 2018). In con-trast, we propose a deep probabilistic approachthat combines aspects of both these lines of pastwork. Further, while some prior approaches tomodeling fonts rely on stroke or topological rep-resentations of observed glyphs (Campbell andKautz, 2014; Phan et al., 2015; Suveeranont andIgarashi, 2010), ours directly models pixel valuesin rasterized glyph representations and allows usto more easily generalize to fonts with variableglyph topologies.

    Finally, while we focus our evaluation on fontreconstruction, our approach has an important re-lationship with style transfer – a framing made ex-

    plicit by Zhang et al. (2018, 2020) – as the goalof our analysis is to learn a smooth manifold offont styles that allows for stylistic inference givena small sample of glyphs. However, many otherstyle transfer tasks in the language domain (Shenet al., 2017) suffer from ambiguity surroundingthe underlying division between style and seman-tic content. By contrast, in this setting the distinc-tion is clearly defined, with content (i.e. the char-acter) observed as a categorical label denoting thecoarse overall shape of a glyph, and style (i.e. thefont) explaining lower-level visual features suchas boldness, texture, and serifs. The modelingapproach taken here might inform work on morecomplex domains where the division is less clear.

    3 Font Reconstruction

    We can view a collection of fonts as a matrix,X , where each column corresponds to a particu-lar character type, and each row corresponds to aspecific font. Each entry in the matrix, xij , is animage of the glyph for character i in the style of afont j, which we observe as a 64 × 64 grayscaleimage as shown in Figure 1. In a real world set-ting, the equivalent matrix would naturally havemissing entries wherever the encoding for a char-acter type in a font is undefined. In general, not allfonts contain renderings of all possible charactertypes; many will only support one particular lan-guage or alphabet and leave out uncommon sym-bols. Further, for many commercial applications,only the small subset of characters that appears ina specific advertisement or promotional messagewill have been designed by the artist – the major-ity of glyphs are missing. As a result, we may wishto have models that can infer these missing glyphs,a task referred to as font reconstruction.

    Following recent prior work (Azadi et al.,2018), we define the task setup as follows: Dur-ing training we have access to a large collectionof observed fonts for the complete character set.At test time we are required to predict the missingglyph images in a collection of previously unseenfonts with the same character set. Each test fontwill contain observable glyph images for a smallrandomized subset of the character set. Based onthe style of this subset, the model must reconstructglyphs for the rest of the character set.

    Font reconstruction can be thought of as a formof matrix completion; given various observationsin both a particular row and column, we wish to

  • 𝑒1𝑒2𝑒3

    𝑒3𝑧4

    𝑧1 𝑧2𝑧3 𝑧4

    𝑧4𝑧3𝑧2𝑧1Gaussian

    Prior

    Font EmbeddingLatent Variables

    Character EmbeddingParameters Decoder Architecture

    Observed Glyphs

    Transpose Conv

    Figure 2: Depiction of the generative process of our model. Each observed glyph image is generated conditionedon the latent variable of the corresponding font and the embedding parameter of the corresponding character type.For a more detailed description of the decoder architecture and hyperparameters, see Appendix A.

    reconstruct the element at their intersection. Alter-natively we can view it as a few-shot style transfertask, in that we want to apply the characteristicattributes of a new font (e.g. serifs, italicization,drop-shadow) to a letter using a small number ofexamples to infer those attributes. Past work onfont reconstruction has focused on discriminativetechniques. For example Azadi et al. (2018) usedan adversarial network to directly predict held outglyphs conditioned on observed glyphs. By con-trast, we propose a generative approach using adeep latent variable model. Under our approachfonts are generated based on an unobserved styleembedding, which we can perform inference overgiven any number of observations.

    4 Model

    Figure 2 depicts our model’s generative process.Given a collection of images of glyphs consistingof I character types across J fonts, our model hy-pothesizes a separation of character-specific struc-tural attributes and font-specific stylistic attributesinto two different representations. Since all char-acters are observed in at least one font, each char-acter type is represented as an embedding vectorwhich is part of the model’s parameterization. Incontrast, only a subset of fonts is observed duringtraining and our model will be expected to gen-eralize to reconstructing unseen fonts at test time.Thus, our representation of each font is treated as

    a vector-valued latent variable rather than a deter-ministic embedding.

    More specifically, for each font in the collec-tion, a font embedding variable, zj ∈ Rk, issampled from a fixed multivariate Gaussian prior,p(zj) = N (0, Ik). Next, each glyph image, xij ,is generated independently, conditioned on thecorresponding font variable, zj , and a character-specific parameter vector, ei ∈ Rk, which we re-fer to as a character embedding. Thus, glyphs ofthe same character type share a character embed-ding, while glyphs of the same font share a fontvariable. A corpus of I ∗ J glyphs is modeledwith only I character embeddings and J font vari-ables, as seen in the left half of Figure 2. Thismodeling approach can be thought of as a form ofdeep matrix factorization, where the content at anygiven cell is purely a function of the vector repre-sentations of the corresponding row and column.We denote the full corpus of glyphs as a matrixX = ((x11, ..., x1J) , ..., (xI1, ...xIJ)) and denotethe corresponding character embeddings as E =(e1, ..., eI) and font variables as Z = (z1, ..., zJ).

    Under our model, the probability distributionover each image, conditioned on zj and ei, is pa-rameterized by a neural network, described in thenext section and depicted in Figure 2. We denotethis decoder distribution as p(xij |zj ; ei, φ), andlet φ represent parameters, shared by all glyphs,that govern how font variables and character em-

  • beddings combine to produce glyph images. Incontrast, the character embedding parameters, ei,which feed into the decoder, are only shared byglyphs of the same character type. The font vari-ables, zj , are unobserved during training and willbe inferred at test time. The joint probability underour model is given by:

    p(X,Z;E, φ) =∏i,j

    p(xij |zj ; ei, φ)p(zj)

    4.1 Decoder Architecture

    One way to encourage the model to learn disen-tangled representations of style and content is bychoosing an architecture that introduces helpfulinductive bias. For this domain, we can think ofthe character type as specifying the overall shapeof the image, and the font style as influencing thefiner details; we formulate our decoder with thisdifference in mind. We hypothesize that a glyphcan be modeled in terms of a low-resolution char-acter representation to which a complex operatorspecific to that font has been applied.

    The success of transpose convolutional layers atgenerating realistic images suggests a natural wayto apply this intuition. A transpose convolution1

    is a convolution performed on an undecimated in-put (i.e. with zeros inserted in between pixels inalternating rows and columns), resulting in an up-scaled output. Transpose convolutional architec-tures generally start with a low resolution inputwhich is passed through several such layers, it-eratively increasing the resolution and decreasingthe number of channels until the final output di-mensions are reached. We note that the asymme-try between the coarse input and the convolutionalfilters closely aligns with the desired inductive bi-ases, and therefore use this framework as a startingpoint for our architecture.

    Broadly speaking, our architecture representsthe underlying shape that defines the specific char-acter type (but not the font) as coarse-grained in-formation that therefore enters the transpose con-volutional process early on. In contrast, the stylis-tic content that specifies attributes of the specificfont (such as serifs, drop shadow, texture) is rep-resented as finer-grained information that entersinto the decoder at a later stage, by parameteriz-ing filters, as shown in the right half of Figure 2.Specifically we form our decoder as follows: first

    1sometimes erroneously referred to as a “deconvolution”

    the character embedding is projected to a low res-olution matrix with a large number of channels.Following that, we apply several transpose convo-lutional layers which increase the resolution, andreduce the number of channels. Critically, the con-volutional filter at each step is not a learned param-eter of the model, but rather the output of a smallmultilayer perceptron whose input is the font la-tent variable z. Between these transpose convo-lutions, we insert vanilla convolutional layers tofine-tune following the increase in resolution.

    Overall, the decoder consists of four blocks,where each block contains a transpose convolu-tion, which upscales the previous layer and re-duces the number of channels by a factor of two,followed by two convolutional layers. Each (trans-pose) convolution is followed by an instance normand a ReLU activation. The convolution filters allhave a kernel size of 5 × 5. The character em-bedding is reshaped via a two-layer MLP into a8 × 8 × 256 tensor before being fed into the de-coder. The final 64× 64 dimensional output layeris treated as a grid of parameters which definesthe output distribution on pixels. We describe thespecifics of this distribution in the next section.

    4.2 Projected Loss

    The conventional approach for computing loss onimage observations is to use an independent out-put distribution, typically a Gaussian, on eachpixel’s intensity. However, deliberate analysis ofthe statistics of natural images has shown that im-ages are not well-described in terms of statisticallyindependent pixels, but are instead better modeledin terms of edges (Field, 1987; Huang and Mum-ford, 1999). It has also been demonstrated that im-ages of text have similar statistical distributions asnatural images (Melmer et al., 2013). Followingthis insight, as our reconstruction loss we use aheavy-tailed (leptokurtotic) distribution placed ona transformed representation of the image, similarto the approach of Barron (2019). Modeling thestatistics of font glyphs in this fashion results insharper samples, while modeling independent pix-els with a Gaussian distribution results in blurry,oversmoothed results.

    More specifically, we adopt one of the strate-gies employed in Barron (2019), and transformimage observations using the orthonormal variantof the 2-Dimensional Discrete Cosine Transform(2-D DCT-II) (Ahmed et al., 1974), which we de-

  • 𝒩(𝜇, Σ)ObservationSubsampling

    𝑒1

    𝑒3Pointwise

    Max

    Latent Space 𝑒1𝑒2𝑒3

    𝕂𝕃

    𝔼𝑞 log𝑝

    Transpose Convolutional Decoder

    𝒩(0, I)Discrete Cosine

    Transform

    Convolutional Encoder

    Figure 3: Depiction of the computation graph of the amortized variational lower bound (for simplicity, only onefont is shown). The encoder approximates the generative model’s true posterior over the font style latent variablesgiven the observations. It remains insensitive to the number of observations by pooling high-level features acrossglyphs. For a more specific description of the encoder architecture details, see Appendix A.

    note as f : R64×64 → R64×64 for our 64 × 64 di-mensional image observations. We transform boththe observed glyph image and the correspondinggrid or parameters produced by our decoder be-fore computing the observation’s likelihood.

    This procedure projects observed images onto agrid of orthogonal bases comprised of shifted andscaled 2-dimensional cosine functions. Becausethe DCT-II is orthonormal, this transformation isvolume-preserving, and so likelihoods computedin the projected space correspond to valid mea-surements in the original pixel domain.

    Note that because the DCT-II is simply a rota-tion in our vector space, imposing a normal dis-tribution in this transformed space should have lit-tle effect (ignoring the scaling induced by the di-agonal of the covariance matrix of the Gaussiandistribution) as Euclidean distance is preservedunder rotations. For this reason we impose aheavy-tailed distribution in this transformed space,specifically a Cauchy distribution. This gives thefollowing probability density function

    g(x; x̂, γ) =1

    πγ

    (1 +

    (f(x)−f(x̂)

    γ

    )2)where x is an observed glyph, x̂ is the location

    parameter grid output by our decoder, and γ is ahyperparameter which we set to γ = 0.001.

    The Cauchy distribution accurately captures theheavy-tailed structure of the edges in natural im-

    ages. Intuitively, it models the fact that imagestend to be mostly smooth, with a small amount ofnon-smooth variation in the form of edges. Com-puting this heavy-tailed loss over the frequencydecomposition provided by the DCT-II instead ofthe raw pixel values encourages the decoder togenerate sharper images without needing eitheran adversarial discriminator or a vectorized rep-resentation of the characters during training. Notethat while our training loss is computed in DCT-IIspace, at test time we treat the raw grid of param-eter outputs x̂ as the glyph reconstruction.

    5 Learning and Inference

    Note that in our training setting, the font variablesZ are completely unobserved, and we must inducetheir manifold with learning. As our model is gen-erative, we wish to optimize the marginal proba-bility of just the observed X with respect to themodel parameters E and φ:

    p(X;E, φ) =

    ∫Zp(X,Z;E, φ)dZ

    However, the integral over Z is computation-ally intractable, given that the complex relation-ship between Z and X does not permit a closedform solution. Related latent variable modelssuch as Variational Autoencoders (VAE) (Kingmaand Welling, 2014) with intractable marginalshave successfully performed learning by opti-mizing a variational lower bound on the logmarginal likelihood. This surrogate objective,

  • called the evidence lower bound (ELBO), intro-duces a variational approximation, q(Z|X) =∏j q(zi|x1j , . . . , xIj) to the model’s true poste-

    rior, p(Z|X). Our model’s ELBO is as follows:

    ELBO =∑j

    Eq[log p(x1j , . . . , xIj |zj)]

    −KL(q(zj |x1j , . . . , xIj)||p(zj))

    where the approximation q is parameterized viaa neural encoder network. This lower bound canbe optimized by stochastic gradient ascent if q isa Gaussian, via the reparameterization trick de-scribed in (Kingma and Welling, 2014; Rezendeet al., 2014) to sample the expectation under qwhile still permitting backpropagation.

    Practically speaking, a key property which wedesire is the ability to perform consistent inferenceover z given a variable number of observed glyphsin a font. We address this in two ways: throughthe architecture of our encoder, and through a spe-cial masking process in training; both of which areshown in Figure 3.

    5.1 Posterior ApproximationObservation Subsampling: To get reconstruc-tions from only partially observed fonts at testtime, the encoder network must be able to infer zjfrom any subset of (x1j , . . . , xIj). One approachfor achieving robustness to the number of obser-vations is through the training procedure. Specifi-cally when computing the approximate posteriorfor a particular font in our training corpus, wemask out a randomly selected subset of the char-acters before passing them to the encoder. Thisincentivizes the encoder to produce reasonable es-timates without becoming too reliant on the fea-tures extracted from any one particular character,which more closely matches the setup at test time.Encoder Architecture: Another way to encour-age this robustness is through inductive bias in theencoder architecture. Specifically we use a convo-lutional neural network which takes in a batch ofcharacters from a single font, concatenated withtheir respective character type embedding. Fol-lowing the final convolutional layer, we performan elementwise max operation across the batch,reducing to a single vector representation for theentire font which we pass through further fully-connected layers to obtain the output parametersof q as shown in Figure 3. By including this ac-cumulation across the elements of the batch, we

    combine the features obtained from each charac-ter in a manner that is largely invariant to thetotal number and types of characters observed.This provides an inductive bias that encourages themodel to extract similar features from each charac-ter type, which should therefore represent stylisticas opposed to structural properties.

    Overall, the encoder over each glyph consists ofthree blocks, where each block consists of a con-volution followed by a max pool with a stride oftwo, an instance norm (Ulyanov et al., 2016), anda ReLU. The activations are then pooled across thecharacters via an elementwise max into a singlevector, which is then passed through four fully-connected layers, before predicting the parametersof the Gaussian approximate posterior.Reconstruction via Inference: At test time, wepass an observed subset of a new font to our en-coder in order to estimate the posterior over zj ,and take the mean of that distribution as the in-ferred font representation. We then pass this en-coding to the decoder along with the full set ofcharacter embeddings E in order to produce re-constructions of every glyph in the font.

    6 Experiments

    We now provide an overview of the specifics ofthe dataset and training procedure, and describeour experimental setup and baselines.

    6.1 Data

    We compare our model against baseline sys-tems at font reconstruction on the Capitals64dataset (Azadi et al., 2018), which contains the 26capital letters of the English alphabet as grayscale64 × 64 pixel images across 10, 682 fonts. Theseare broken down into training, dev, and test splitsof 7649, 1473, and 1560 fonts respectively.

    Upon manual inspection of the dataset, it isapparent that several fonts have an almost visu-ally indistinguishable nearest neighbor, makingthe reconstruction task trivial using a naive algo-rithm (or an overfit model with high capacity) forthose particular font archetypes. Because thesedatapoints are less informative with respect to amodel’s ability to generalize to previously unseenstyles, we additionally evaluate on a second test setdesigned to avoid this redundancy. Specifically,we choose the 10% of test fonts that have max-imal L2 distance from their closest equivalent inthe training set, which we call “Test Hard”.

  • Test Full Test Hard

    Observations 1 2 4 8 1 2 4 8

    NN 483.13 424.49 386.81 363.97 880.22 814.67 761.29 735.18GlyphNet 669.36 533.97 455.23 416.65 935.01 813.50 718.02 653.57

    Ours (FC) 353.63 316.47 293.67 281.89 596.57 556.21 527.50 513.25Ours (Conv) 352.07 300.46 271.03 254.92 615.87 556.03 511.05 489.58

    Table 1: L2 reconstruction per glyph by number of observed characters. “Full” includes the entire test set while“Hard” is measured only over the 10% of test fonts with the highest L2 distance from the closest font in train.

    6.2 Baselines

    As stated previously, many fonts fall into visuallysimilar archetypes. Based on this property, we usea nearest neighbors algorithm for our first base-line. Given a partially observed font at test time,this approach simply “reconstructs” by searchingthe training set for the font with the lowest L2 dis-tance over the observed characters, and copy itsglyphs verbatim for the missing characters.

    For our second comparison, we use the Glyph-Net model from Azadi et al. (2018). This ap-proach is based on a generative adversarial net-work, which uses discriminators to encourage themodel to generate outputs that are difficult todistinguish from those in the training set. Wetest from the publicly available epoch 400 check-point, with modifications to the evaluation scriptto match the setup described above.

    We also perform an ablation using fully-connected instead of convolutional layers. Formore architecture details see Appendix A.

    6.3 Training Details

    We train our model to maximize the expectedlog likelihood using the Adam optimization al-gorithm (Kingma and Ba, 2015) with a step sizeof 10−5 (default settings otherwise), and performearly stopping based on the approximate log like-lihood on a hard subset of dev selected by the pro-cess described earlier. To encourage robustness inthe encoder, we randomly drop out glyphs duringtraining with a probability of 0.7 (rejecting sam-ples where all characters in a font are dropped).All experiments are run with a dimensionality of32 for the character embeddings and font latentvariables. Our implementation2 is built in Py-Torch (Paszke et al., 2017) version 1.1.0.

    2https://bitbucket.org/NikitaSrivatsan/DeepFactorizationFontsEMNLP19

    7 Results

    We now present quantitative results from our ex-periments in both automated and human annotatedmetrics, and offer qualitative analysis of recon-structions and the learned font manifold.

    7.1 Quantitative EvaluationAutomatic Evaluation: We show font reconstruc-tion results for our system against nearest neigh-bors and GlyphNet in Table 1. Each model is givena random subsample of glyphs from each test font(we measure at 1, 2, 4, and 8 observed characters),with their character labels. We measure the av-erage L2 distance between the image reconstruc-tions for the unobserved characters and the groundtruth, after scaling intensities to [0, 1].

    Our system achieves the best performance forboth the overall and hard subset of test for all num-bers of observed glyphs. Nearest neighbors pro-vides a strong baseline on the full test set, evenoutperforming GlyphNet. However it performsmuch worse on the hard subset. This makes senseas we expect nearest neighbors to do extremelywell on any test fonts that have a close equiva-lent in train, but suffer in fidelity on less tradi-tional styles. GlyphNet similarly performs worseon test hard, which could reflect the missing modesproblem of GANs failing to capture the full di-versity of the data distribution (Che et al., 2016;Tolstikhin et al., 2017). The fully-connected ab-lation is also competitive, although we see thatthe convolutional architecture is better able to in-fer style from larger numbers of observations. Onthe hard test set, the fully-connected network evenoutperforms the convolutional system when onlyone observation is present, perhaps indicating thatits lower-capacity architecture better generalizesfrom very limited data.

    https://bitbucket.org/NikitaSrivatsan/DeepFactorizationFontsEMNLP19https://bitbucket.org/NikitaSrivatsan/DeepFactorizationFontsEMNLP19

  • Figure 4: Reconstructions of partially observed fonts inthe hard subset from our model, GlyphNet, and nearestneighbors. Given images of glyphs for ‘A’ and ‘B’ ineach font, we visualize reconstructions of the remain-ing characters. Fonts are chosen such that the L2 lossof our model on these examples closely matches theaverage loss over the full evaluation set.

    Human Evaluation: To measure how consis-tent these perceptual differences are, we also per-form a human evaluation of our model’s recon-structions against GlyphNet using Amazon Me-chanical Turk (AMT). In our setup, turkers wereasked to compare the output of our model againstthe output of GlyphNet for a single font given oneobserved character, which they were also shown.Turkers selected a ranking based on which recon-structed font best matched the style of the ob-served character, and a separate ranking based onwhich was more realistic. On the full test set (1560fonts, with 5 turkers assigned to each font), hu-mans preferred our system over GlyphNet 81.3%and 81.8% of the time for style and realism respec-tively. We found that on average 76% of turkersshown a given pair of reconstructions selected thesame ranking as each other on both criteria, sug-gesting high annotator agreement.

    7.2 Qualitative Analysis

    In order to fully understand the comparative be-havior of these systems, we also qualitatively com-pare the reconstruction output of these systems toanalyze their various failure modes, showing ex-amples in Figure 4. We generally find that ourapproach tends to produce reconstructions that,

    Figure 5: t-SNE projection of latent font variables in-ferred for the full training set, colored by k-means clus-tering with k = 10. The glyph for the letter “A” foreach centroid is shown in overlay.

    while occasionally blurry at the edges, are gener-ally faithful at reproducing the principal stylisticfeatures of the font. For example, we see that forfont (1) in Figure 4, we match not only the overallshape of the letters, but also the drop shadow andto an extent the texture within the lettering, whileGlyphNet does not produce fully enclosed lettersor match the texture. The output of nearest neigh-bors, while well-formed, does not respect the styleof the font as closely as it fails to find a font intraining that matches these stylistic properties. Infont (2) the systems all produce a form of gothiclettering, but the output of GlyphNet is again lack-ing in certain details, and nearest neighbors makessubtle but noticeable changes to the shape of theletters. In the final example (3) we even see thatour system appears to attempt to replicate the pix-elated outline, while nearest neighbors ignores thissubtlety. GlyphNet is in this case somewhat in-consistent, doing reasonably well on some letters,but much worse on others. Overall, nearest neigh-bors will necessarily output well-formed glyphs,but with lower fidelity to the style, particularly onmore unique fonts. While GlyphNet does pick upon some subtle features, our model tends to pro-duce the most coherent output on harder fonts.

    7.3 Analysis of Learned ManifoldSince our model attempts to learn a smooth man-ifold over the latent style, we can also performinterpolation between the inferred font represen-tations, something which is not directly possibleusing either of the baselines. In this analysis, we

  • Figure 6: Interpolation be-tween font variants fromthe same font family, show-ing smoothness of the latentmanifold. Linear combina-tions of the embedded fontscorrespond to outputs thatlie intuitively “in between”.

    ����������������������������������

    �������������������������������������������������

    Figure 7: t-SNE projection of latent font variables inferred on Google Fonts, colored by weight and category.

    take two fonts from the same font family, whichdiffer along one property, and pass them throughour encoder to obtain the latent font variable foreach. We then interpolate between these values,passing the result at various steps into our decoderto produce new fonts that exists in between the ob-servations. In Figure 6 we see how our model canapply serifs, italicization, and boldness graduallywhile leaving the font unchanged in other respects.This demonstrates that our manifold is smooth andinterpretable, not just at the points correspondingto those in our dataset. This could be leveraged tomodify existing fonts with respect to a particularattribute to generate novel fonts efficiently.

    Beyond looking at the quality of reconstruc-tions, we also wish to analyze properties of thelatent space learned by our model. To do this,we use our encoder to infer latent font variablesz for each of the fonts in our training data, anduse a t-SNE projection (Van Der Maaten, 2013)to plot them in 2D, shown in Figure 5. Sincethe broader, long-distance groupings may not bepreserved by this transformation, we perform a k-means clustering (Lloyd, 1982) with k = 10 on

    the high-dimensional representations to visualizethese high-level groupings. Additionally, we dis-play a sample glyph for the centroid for each clus-ter. We see that while the clusters are not com-pletely separated, the centroids generally corre-spond to common font styles, and are largely ad-jacent to or overlapping with those with similarstylistic properties; for example script, handwrit-ten, and gothic styles are very close together.

    To analyze how well our latent embeddings cor-respond to human defined notions of font style,we also run our system on the Google Fonts 3

    dataset which despite containing fewer fonts andless diversity in style, lists metadata including nu-merical weight and category (e.g. serif, hand-writing, monospace). In Figure 7 we show t-SNE projections of latent font variables from ourmodel trained on Google Fonts, colored accord-ingly. We see that the embeddings do generallycluster by weight as well as category suggestingthat our model is learning latent information con-sistent with how humans perceive font style.

    3https://github.com/google/fonts

    https://github.com/google/fonts

  • 8 Conclusion

    We presented a latent variable model of glyphswhich learns disentangled representations of thestructural properties of underlying characters fromstylistic features of the font. We evaluated ourmodel on the task of font reconstruction andshowed that it outperformed both a strong nearestneighbors baseline and prior work based on GANsespecially for fonts highly dissimilar to any in-stance in the training set. In future work, it may beworth extending this model to learn a latent mani-fold on content as well as style, which could allowfor reconstruction of previously unseen charactertypes, or generalization to other domains wherethe notion of content is higher dimensional.

    Acknowledgments

    This project is funded in part by the NSF undergrant 1618044 and by the NEH under grant HAA-256044-17. Special thanks to Si Wu for help con-structing the Google Fonts figures.

    ReferencesNasir Ahmed, T Natarajan, and Kamisetty R Rao.

    1974. Discrete cosine transform. IEEE Transac-tions on Computers.

    Samaneh Azadi, Matthew Fisher, Vladimir G Kim,Zhaowen Wang, Eli Shechtman, and Trevor Darrell.2018. Multi-content GAN for few-shot font styletransfer. CVPR.

    Jonathan T. Barron. 2019. A general and adaptive ro-bust loss function. CVPR.

    Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein.2013. Unsupervised transcription of historical doc-uments. ACL.

    Taylor Berg-Kirkpatrick and Dan Klein. 2014. Im-proved typesetting models for historical OCR. ACL.

    Neill DF Campbell and Jan Kautz. 2014. Learning amanifold of fonts. ACM TOG.

    Tong Che, Yanran Li, Athul Paul Jacob, YoshuaBengio, and Wenjie Li. 2016. Mode regularizedgenerative adversarial networks. arXiv preprintarXiv:1612.02136.

    Vincent Dumoulin and Francesco Visin. 2016. A guideto convolution arithmetic for deep learning. arXivpreprint arXiv:1603.07285.

    David J. Field. 1987. Relations between the statisticsof natural images and the response properties of cor-tical cells. JOSA A.

    William T Freeman and Joshua B Tenenbaum. 1997.Learning bilinear models for two-factor problems invision. In Proceedings of IEEE Computer SocietyConference on Computer Vision and Pattern Recog-nition, pages 554–560. IEEE.

    Leon A Gatys, Alexander S Ecker, and MatthiasBethge. 2016. Image style transfer using convolu-tional neural networks. CVPR.

    Douglas R Hofstadter. 1983. Metamagical themas.Scientific American, 248(5):16–E18.

    Douglas R Hofstadter. 1995. Fluid concepts and cre-ative analogies: Computer models of the fundamen-tal mechanisms of thought. Basic books.

    Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and Eric P Xing. 2017. Controllabletext generation. arXiv preprint arXiv:1703.00955,7.

    Jinggang Huang and David Mumford. 1999. Statisticsof natural images and models. CVPR.

    Diederik P Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. ICLR.

    Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. ICLR.

    Stuart Lloyd. 1982. Least squares quantization in pcm.IEEE Transactions on Information Theory.

    Tamara Melmer, Seyed Ali Amirshahi, Michael Koch,Joachim Denzler, and Christoph Redies. 2013.From regular text to artistic writing and artworks:Fourier statistics of images with low and high aes-thetic appeal. Frontiers in Human Neuroscience.

    Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. 2017. Automatic differentiation in PyTorch.In NIPS Autodiff Workshop.

    Huy Quoc Phan, Hongbo Fu, and Antoni B Chan. 2015.Flexyfont: Learning transferring rules for flexibletypeface synthesis. In Computer Graphics Forum,volume 34, pages 245–256. Wiley Online Library.

    Danilo Jimenez Rezende, Shakir Mohamed, and DaanWierstra. 2014. Stochastic backpropagation andapproximate inference in deep generative models.ICML.

    Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola. 2017. Style transfer from non-parallel textby cross-alignment. NIPS.

    Rapee Suveeranont and Takeo Igarashi. 2010.Example-based automatic font generation. InInternational Symposium on Smart Graphics, pages127–138. Springer.

  • Yichuan Tang, Ruslan Salakhutdinov, and GeoffreyHinton. 2013. Tensor analyzers. In Internationalconference on machine learning, pages 163–171.

    Joshua B Tenenbaum and William T Freeman. 2000.Separating style and content with bilinear models.Neural computation, 12(6):1247–1283.

    Ilya O Tolstikhin, Sylvain Gelly, Olivier Bous-quet, Carl-Johann Simon-Gabriel, and BernhardSchölkopf. 2017. Adagan: Boosting generativemodels. In Advances in Neural Information Pro-cessing Systems, pages 5424–5433.

    Dmitry Ulyanov, Andrea Vedaldi, and Victor Lem-pitsky. 2016. Instance normalization: The miss-ing ingredient for fast stylization. arXiv preprintarXiv:1607.08022.

    Laurens Van Der Maaten. 2013. Barnes-Hut-SNE.arXiv preprint arXiv:1301.3342.

    M Alex O Vasilescu and Demetri Terzopoulos. 2002.Multilinear analysis of image ensembles: Tensor-faces. In European Conference on Computer Vision,pages 447–460. Springer.

    Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, ShujianHuang, and Jiajun Chen. 2017. Deep matrix factor-ization models for recommender systems. IJCAI.

    Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, andTaylor Berg-Kirkpatrick. 2017. Improved varia-tional autoencoders for text modeling using dilatedconvolutions. ICML.

    Yexun Zhang, Ya Zhang, and Wenbin Cai. 2018. Sep-arating style and content for generalized style trans-fer. In Proceedings of the IEEE conference on com-puter vision and pattern recognition, pages 8447–8455.

    Yexun Zhang, Ya Zhang, and Wenbin Cai. 2020. Aunified framework for generalizable style transfer:Style and content separation. IEEE Transactions onImage Processing, 29:4085–4098.

    Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. 2017. Unpaired image-to-image translationusing cycle-consistent adversarial networks. ICCV.

    A Appendix

    A.1 Architecture Notation

    We now provide further details on the specific ar-chitectures used to parameterize our model and in-ference network. The following abbreviations areused to represent various components:• Fi : fully-connected layer with i hidden units• R : ReLU activation• M : batch max pool• S : 2× 2 spatial max pool

    • Ci : convolution filters with i filters of 5× 5,2 pixel zero-padding, stride of 1, dilation of 1• I : instance normalization• Ti,j,k : transpose convolution with i filters of5×5, 2 pixel zero-padding, stride of j, k pixeloutput padding, dilation of 1, where kerneland bias are the output of an MLP (describedbelow)• H : reshape to 26× 256× 8× 8

    A.2 Network ArchitectureOur fully-connected encoder is:F128-R-F128-R-F128-R-F1024-R-M-F128-R-F128-R-F128-R-F64

    Our convolutional encoder is:C64-S-I-R-C128-S-I-R-C256-S-I-R-F1024-M-R-F128-R-F128-R-F128-R-F64

    Our fully-connected decoder is:F128-R-F128-R-F128-R-F128-R-F128-R-F4096

    Our transpose convolutional decoder is:F128-R-F16384-R-H-T256,2,1-R-C256-I-R-C256-I-R-T128,2,1-R-C128-I-R-C128-I-R-T64,2,1-I-R-C64-I-R-C64-I-R-T32,1,0-I-R-C32-I-R-C1

    MLP to compute a transpose convolutional param-eter of size j is:F128-R-Fj