Quantum machine learning: the what, why and how? Roman ...groups.chem.ubc.ca/krems/talks/TalkUBCFeb2019.pdf · We show that quantum computation provides a much better framework for

Quantum machine learning:the what, why and how?

Roman KremsUniversity of British Columbia

Quantum machine learning

I. Machine learning implemented on quantum devices, e.g.

II. Machine learning for solving quantum problems

• Predicting the properties of complex quantum systems by ML

• Quantum neural networks• Classical ML trained by optimization using

quantum devices (such as D-wave’s annealer )

Goal of Quantum Machine Learning

• Algorithms that could be trained faster• Algorithms that yield more accurate results• Algorithms that process big data faster

QML seeks to exploit quantum devices in order to design:

Here, `faster’ and `more accurate’ is by comparison with classical ML algorithms

Example 1: Quantum annealing

Quantum annealing

Say, we have a Hamiltonian

H =X

i

bi�zi +

X

i,j

wi,j�zi �

zj

with given bi and wi,j

Question: what is the ground state of the system described by thisHamiltonian?

D-wave’s quantum annealer

H = A(t)X

i

�xi � B(t)

2

4X

i

bi�zi +

X

i,j

wi,j�zi �

zj

3

5

A(t = 0) � B(t = 0) ⇡ 0 A(t) ⇡ 0 ⌧ B(t)

bi and wi,j are tunable parameters

M. H. Amin, 2015, `Searching for quantum speed-up in quasistatic quantum annealers’

Example 2: Quantum Boltzmann machines

Boltzmann machine

A type of a neural network2

Visible units

Output layer

First layer of hidden units

Second layer of hidden units

(a)

FIG. 1: A graphical representation of two types of Boltzmann machines. (a) A 4–layer deep restricted Boltzmann machine(dRBM) where each black circle represents a hidden or visible unit and each edge represents a non–zero weight for thecorresponding interaction. The output layer is often treated as a visible layer to provide a classification of the data input in thevisible units at the bottom of the graph. (b) An example of a 5–unit full Boltzmann machine (BM). The hidden and visibleunits no longer occupy distinct layers due to connections between units of the same type.

where Ntrain is the size of the training set, xtrain

is the set of training vectors, and � is an L2–regularization term tocombat overfitting. The derivative of O

ML

with respect to the weights is

@OML

@wi,j= hvihji

data

� hvihjimodel

� �wi,j , (4)

where the brackets denote the expectation values over the data and model for the BM. The remaining derivativestake a similar form [5].

Computing these gradients directly from (1) and (4) is exponentially hard in nv and nh; thus, classical approachesresort to approximations such as contrastive divergence [3, 5–8]. Unfortunately, contrastive divergence does notprovide the gradient of any true objective function [9], it is known to lead to suboptimal solutions [10–12], it is notguaranteed to converge in the presence of certain regularization functions [9], and it cannot be used directly to traina full Boltzmann machine. We show that quantum computation provides a much better framework for deep learningand illustrate this by providing e�cient alternatives to these methods that are elementary to analyze, accelerate thelearning process and lead to better models for the training data.

GEQS Algorithm

We propose two quantum algorithms: Gradient Estimation via Quantum Sampling (GEQS) and Gradient Estima-tion via Quantum Ampitude Estimation (GEQAE). These algorithms prepare a coherent analog of the Gibbs statefor Boltzmann machines and then draw samples from the resultant state to compute the expectation values in (4).Formal descriptions of the algorithms are given in the appendix. Existing algorithms for preparing these states [13–17]tend not to be e�cient for machine learning applications or do not o↵er clear evidence of a quantum speedup. Theine�ciency of [15, 17] is a consequence of the uniform initial state having small overlap with the Gibbs state. Thecomplexities of these prior algorithms, along with our own, are given in Table I

Our algorithms address this problem by using a non-uniform prior distribution for the probabilities of each configu-ration, which is motivated by the fact that we know a priori from the weights and the biases that certain configurationswill be less likely than others. We obtain this distribution by using a mean–field (MF) approximation to the config-uration probabilities. This approximation is classically e�cient and typically provides a good approximation to theGibbs states observed in practical machine learning problems [7, 18, 19]. Our algorithms exploit this prior knowledgeto refine the Gibbs state from copies of the MF state. This allows the Gibbs distribution to be prepared e�cientlyand exactly if the two states are su�ciently close.

The MF approximation, Q(v, h), is defined to be the product distribution that minimizes the Kullback–Leiblerdivergence KL(Q||P ). The fact that it is a product distribution means that it can be e�ciently computed and also

N. Wiebe, A. Kapoor, K. M. Svore, 2015, `Quantum Deep Learning'

Boltzmann machine

v ) [v1

, v2

, ..., vV ] visible units

h ) [h1

, h2

, ..., hH] hidden units

vi = ±1 hi = ±1

z = (v,h)

Energy : E(v,h) = �X

i

bizi �X

i,j

wi,jzizj

Probability to observe state v of visible units : Pv = �P

h e�E(v,h)

Z

Partition function : Z =

X

z

e�E(v,h)

How to train a Boltzmann machine

Let’s say you have a certain number of state vectors,

(each with dimension equal to the number of visible units)

The goal is find a Boltzmann distribution, in which the training vectors

have high probabiity

Thus, we want to maximize the following function:

L /X

v2training set

logPv

Pv =

Ph e�E(v,h)

Z


Let’s say you have a certain number of state vectors,

(each with dimension equal to the number of visible units)

The goal is to find a Boltzmann distribution, in which the training

vectors have high probability


L /X

v2training set

logPv

@L@wi,j

/ hvihjitraining � hvihjimodel


Another way to formulate the problem.

Let’s say you have a certain distribution of state vectors

Let’s denote it by P+

v

The goal is to find a Boltzmann machine, where Pv = P+

v


L /X

v

P+

v logPv

Pv =

Ph e�E(v,h)

Z

How to use a Boltzmann machine for discrimination

G. Hinton, 2010, `A practical guide to training restricted Boltzmann machines'

Let’s say you have several classes of data. How to discriminate between them?

• Include a `label’ of the class as a visible unit• Train the BM• Try the test vector with all the different class

labels• The one with the lowest free energy is the most

likely class

The goal is to compute:

@L@wi,j


For restricted Boltzmann Machines (with bipartite graph structure),the first term is easy to compute using:

P (hj = 1|v) = sigm�bj + wi,jvi

�

However, the second term is hard to compute!

Run quantum annealing N times and compute

hvihjimodel =1

N

X

n

vni hnj

Training a general Boltzmann machine is difficult!

Training a general Boltzmann machine is difficult!

From WikiPedia:

the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths

Quantum computing can help!

N. Wiebe, A. Kapoor, K. M. Svore, 2015, `Quantum Deep Learning'

Quantum sampling for training a Boltzmann machine

M. H. Amin, 2015, `Searching for quantum speed-up in quasistatic quantum annealers’

Energy : E(v,h) = �X

i

bizi �X

i,j

wi,jzizj

Quantum system : H = �X

i

bi�zi �

X

i,j

wi,j�zi �

zj

D-wave’s annealer:

H = A(t)X

i

�xi � B(t)

2

4X

i

bi�zi +

X

i,j

wi,j�zi �

zj

3

5

A(t = 0) � B(t = 0) ⇡ 0 A(t) ⇡ 0 ⌧ B(t)

bi and wi,j are tunable parameters

Quantum sampling for training a Boltzmann machine

S. H. Adachi, M. P. Henderson, 2015, `Application of quantum annealing to training deep neural network’

The goal is to compute:

@L@wi,j


For restricted Boltzmann Machines (with bipartite graph structure),the first term is easy to compute using:

P (hj = 1|v) = sigm�bj + wi,jvi

�

However, the second term is hard to compute!

Run quantum annealing N times and compute

hvihjimodel =1

N

X

n

vni hnj

Combine machine learning with quantum dynamics calculations to solve problems generally considered unfeasible

Complexity of many-body quantum systems

Disorder

Interactions


Disorder

Interactions


Disorder

Inte

ract

ions

Anderson localization

??? Many-body localization

Classical chaos

weak localization

Non-Markovian environments

Amenable to theoretical study

Disorder

Inte

ract

ions

Anderson localization

??? Many-body localization

Classical chaos

weak localization

Non-Markovian environments

Machine Learning for extrapolation of quantum properties

Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)

PhaseI

PhaseII

PhaseIII

Hamiltonian parameters

Can Phase I be used to predict the properties of Phases II and III?

Machine Learning for extrapolation of quantum properties


Gaussian Process = Gaussian Random Function

−2

0

2

−5.0 −2.5 0.0 2.5 5.0Input variable

Ran

dom

func

tions

Gaussian Process Regression

●

●

●

●

●

−2

0

2

−4 −2 0 2Input variable

Dyn

amic

al o

bser

vabl

es


●

●

●

●

●

−2

0

2

−5.0 −2.5 0.0 2.5 5.0Input variable

Ran

dom

func

tions


●

●

●

●

●

−2

0

2

−5.0 −2.5 0.0 2.5 5.0Input variable

Con

stra

ined

rand

om fu

nctio

ns

⌦ = ⌦(x) with x =�x1, x2, · · · , xq

�>

Observable

Step 1. Calculate the observable at a few points of this multi-dimensional space

Gaussian Process Regression – how does it work?

Hamiltonian Parameters


Step 2. Estimate correlations between the calculated points

R(x,x

0) =

qY

i=1

✓1+

p5|x

i

� x

0i

|!

i

+

5(x

i

� x

0i

)

2

3!

2

i

◆exp

✓�p5|x

i

� x

0i

|!

i

◆.

Assume some mathematical form for the correlations, e.g.,

and find the maximum likelihood estimates of the unknown coefficients.


with

R(x,x

0) =

qY

i=1

✓1+

p5|x

i

� x

0i

|!

i

+

5(x

i

� x

0i

)

2

3!

2

i

◆exp

✓�p5|x

i

� x

0i

|!

i

◆.

µ(x) =

sX

j=1

h

j

(x)�

j

= h(x)>�

logL(!|Y n

) = �1

2

hnlog�

2

+ log(det(A)) + n

i

A =

0

BB@

1 R(x

1

,x

2

) · · · R(x

1

,x

n

)

R(x

2

,x

1

) 1

.

.

.

.

.

.

.

.

.

R(x

n

,x

1

) · · · 1

1

CCA (1)

We do this by maximizing the log-likelihood function:

�2(!) =1

n(Y n � �)>A�1(Y n � �) (1)

�(!) = (I>A�1I)�1I>A�1Y n (2)


Step 2. Estimate correlations

R(x,x

0) =

qY

i=1

✓1+

p5|x

i

� x

0i

|!

i

+

5(x

i

� x

0i

)

2

3!

2

i

◆exp

✓�p5|x

i

� x

0i

|!

i

◆.

ØAny simple function works for interpolation●

●

●

●

●

−2

0

2

−5.0 −2.5 0.0 2.5 5.0Input variable

Con

stra

ined

rand

om fu

nctio

ns

ØExtrapolation is harder!

ØNecessary to find kernels capable of extrapolating physically!

Gaussian Process Regression – kernel optimization

How to find the best kernel?

2

tron and phonon dispersions, and Ve�ph is the electron-phonon coupling. We choose Ve�ph to be a combinationof two qualitatively di↵erent terms Ve�ph = ↵H1 + �H2,where

H1 =X

k,q

2ipN

[sin(k + q)� sin(k)] c†k+qck

⇣b†�q + bq

⌘(3)

describes the Su-Schrie↵er-Heeger (SSH) [35] electron-phonon coupling, and

H2 =X

k,q

2ipN

sin(q)c†k+qck

⇣b†�q + bq

⌘(4)

is the breathing-mode model [36]. The ground state bandof the model (2) represents polarons known to exhibittwo sharp transitions as the ratio ↵/� increases fromzero to large values [37]. At ↵ = 0, the model (2) de-scribes breathing-mode polarons, which have no sharptransitions [38]. At � = 0, the model (2) describesSSH polarons, which exhibit one sharp transition in thepolaron phase diagram [35]. At these transitions, theground state momentum of the polaron changes abruptly,as shown in Figure 2 (left). Our goal is to develop a MLmethod that, using properties from one of the phases,can predict the existence or absence of other phases.

Method. We use Gaussian Process (GP) regression asthe prediction method [41], described in detail in the Sup-plemental Material [42]. The goal of the prediction is toinfer an unknown function f(·) given some inputs xi andoutputs yi. The assumption is that yi = f(xi). The func-tion f is generally multidimensional so xi is a vector.

GPs do not infer a single function f(·), but rather adistribution over functions, p(f |X,y), where X is a vec-tor of all known xi and y is a vector of the correspondingvalues yi. This distribution is assumed to be normal. Thejoint Gaussian distribution of random variables f(xi) ischaracterized by a mean µ(x) and a covariance matrixK(·, ·). The matrix elements of the covariance Ki,j arespecified by a kernel function k(xi,xj) that quantifies thesimilarity relation between the properties of the systemat two points xi and xj in the multi-dimensional space.

Prediction at x⇤ is done by computing the conditionaldistribution of f(x⇤) given y and X. The mean of theconditional distribution is [41]

µ(x⇤) =X

i

d(x⇤,xi)yi =X

i

↵ik(x⇤,xi) (5)

where ↵ = K�1y and d = K(x⇤,X)>K(X,X)�1. The

predicted mean µ(x⇤) can be viewed as a linear combina-tion of the training data yi or as a linear combination ofthe kernels connecting all training points xi and the pointx⇤, where the prediction is made. In order to train a GPmodel, one must choose an analytical representation forthe kernel function.

It is – in principle – possible to use Eq. (5) for both in-terpolation and extrapolation. However, the kernel func-tion is found by analyzing a given set of data in order

to produce accurate interpolation results. There is noguarantee that the same kernel function works for ex-

trapolation outside the range of available data.

No kernel

RBF MAT RQ LIN

RQ ⇥ LIN· · ·RQ + MAT · · · RQ + RBF

RQ ⇥ LIN + RBF· · ·RQ ⇥ LIN ⇥ RBF · · · RQ ⇥ LIN + MAT

FIG. 1. Schematic diagram of the kernel construction methodemployed to develop a Gaussian Process model with extrap-olation power. At each iteration, the kernel with the highestBayesian information criterion (11) is selected.

This is exemplified by the fact that the choice ofk(x⇤,xi) in Eq. (5) is not unique. For example, k can beapproximated by any of the following functions:

kLIN(xi,xj) = x

>i xj (6)

kRBF(xi,xj) = exp

✓�1

2r2(xi,xj)

◆(7)

kMAT(xi,xj) =

✓1 +

p5r2(xi,xj) +

5

3r2(xi,xj)

◆

⇥ exp⇣�p5r2(xi,xj)

⌘(8)

kRQ(xi,xj) =

✓1 +

|xi � xj |22↵`2

◆�↵

(9)

where r2(xi,xj) = (xi � xj)> ⇥M ⇥ (xi � xj) and M isa diagonal matrix with di↵erent length-scales `d for eachdimension of xi. The length-scale parameters `d, ` and↵ are the free parameters. We describe them collectivelyby ✓. A GP is trained by finding the estimate of ✓ (de-noted by ✓) that maximizes the logarithm of the marginal

likelihood function:

log p(y|X, ✓,Mi) = �1

2y

>K�1y � 1

2log |K|� n

2log 2⇡

(10)

For solving an interpolation problem, it is su�cient tochoose any simple kernel (6) - (9). The e�ciency of theinterpolation depends on the kernel, but in the limit ofa large number of training points yi, any simple kernelfunction produces accurate results [41]. Eq. (5) showsthat extrapolation is clearly sensitive to the particularchoice of the kernel function. A possible solution to thisproblem is to use more complex functions for the kernels.


Gaussian Process Regression – kernel optimization

No kernel

RBF MAT

RQ

LIN

RQ ⇥ LIN

· · ·RQ + MAT

· · ·RQ + RBF

RQ ⇥ LIN + RBF

· · ·RQ ⇥ LIN ⇥ RBF

· · ·RQ ⇥ LIN + MAT

BIC = �1

2

hnlog�2 + log(det(A)) + n

i� 1

2

M log n


Extrapolation across phase transitions

Goal: extrapolate quantum properties across phase transitions

How? Can’t use properties that diverge at the transitions

Idea: find the quantity that varies smoothly across the transition, “learn it”, extrapolate it, and derive the discontinuous properties from it



ˆH = �J

2

X

i,j

~Si · ~Sj

Free energy density at temperature T:

f (T,m) ⇡ 1

2

✓1� Tc

T

◆m2

+

1

12

✓TcT

◆3

m4

BIC = �1

2


i� 1

2

M log n

Heisenberg model



ˆH = �J

2

X

i,j

~Si · ~Sj

Free energy density at temperature T:

f (T,m) ⇡ 1

2

✓1� Tc

T

◆m2

+

1

12

✓TcT

◆3

m4

BIC = �1

2


i� 1

2

M log n

Heisenberg model



















No kernel

RBF MAT

RQ

LIN

RQ ⇥ LIN

· · ·RQ + MAT

· · ·RQ + RBF

RQ ⇥ LIN + RBF




Machine Learning for extrapolation


ØKernel complexity matters!

ØIt is difficult to build models with kernels that are very complex

ØCan kernels be optimized using a quantum computing device?

No kernel

RBF MAT

RQ

LIN

RQ ⇥ LIN

· · ·RQ + MAT

· · ·RQ + RBF

RQ ⇥ LIN + RBF



Documents

Quantum machine learning: the what, why and how? Roman ...groups.chem.ubc.ca/krems/talks/TalkUBCFeb2019.pdf · We show that quantum computation provides a much better framework for