Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Quantum machine learning:the what, why and how?
Roman KremsUniversity of British Columbia
Quantum machine learning
I. Machine learning implemented on quantum devices, e.g.
II. Machine learning for solving quantum problems
• Predicting the properties of complex quantum systems by ML
• Quantum neural networks• Classical ML trained by optimization using
quantum devices (such as D-wave’s annealer )
Goal of Quantum Machine Learning
• Algorithms that could be trained faster• Algorithms that yield more accurate results• Algorithms that process big data faster
QML seeks to exploit quantum devices in order to design:
Here, `faster’ and `more accurate’ is by comparison with classical ML algorithms
Example 1: Quantum annealing
Quantum annealing
Say, we have a Hamiltonian
H =X
i
bi�zi +
X
i,j
wi,j�zi �
zj
with given bi and wi,j
Question: what is the ground state of the system described by thisHamiltonian?
D-wave’s quantum annealer
H = A(t)X
i
�xi � B(t)
2
4X
i
bi�zi +
X
i,j
wi,j�zi �
zj
3
5
A(t = 0) � B(t = 0) ⇡ 0 A(t) ⇡ 0 ⌧ B(t)
bi and wi,j are tunable parameters
M. H. Amin, 2015, `Searching for quantum speed-up in quasistatic quantum annealers’
Example 2: Quantum Boltzmann machines
Boltzmann machine
A type of a neural network2
Visible units
Output layer
First layer of hidden units
Second layer of hidden units
(a)
FIG. 1: A graphical representation of two types of Boltzmann machines. (a) A 4–layer deep restricted Boltzmann machine(dRBM) where each black circle represents a hidden or visible unit and each edge represents a non–zero weight for thecorresponding interaction. The output layer is often treated as a visible layer to provide a classification of the data input in thevisible units at the bottom of the graph. (b) An example of a 5–unit full Boltzmann machine (BM). The hidden and visibleunits no longer occupy distinct layers due to connections between units of the same type.
where Ntrain is the size of the training set, xtrain
is the set of training vectors, and � is an L2–regularization term tocombat overfitting. The derivative of O
ML
with respect to the weights is
@OML
@wi,j= hvihji
data
� hvihjimodel
� �wi,j , (4)
where the brackets denote the expectation values over the data and model for the BM. The remaining derivativestake a similar form [5].
Computing these gradients directly from (1) and (4) is exponentially hard in nv and nh; thus, classical approachesresort to approximations such as contrastive divergence [3, 5–8]. Unfortunately, contrastive divergence does notprovide the gradient of any true objective function [9], it is known to lead to suboptimal solutions [10–12], it is notguaranteed to converge in the presence of certain regularization functions [9], and it cannot be used directly to traina full Boltzmann machine. We show that quantum computation provides a much better framework for deep learningand illustrate this by providing e�cient alternatives to these methods that are elementary to analyze, accelerate thelearning process and lead to better models for the training data.
GEQS Algorithm
We propose two quantum algorithms: Gradient Estimation via Quantum Sampling (GEQS) and Gradient Estima-tion via Quantum Ampitude Estimation (GEQAE). These algorithms prepare a coherent analog of the Gibbs statefor Boltzmann machines and then draw samples from the resultant state to compute the expectation values in (4).Formal descriptions of the algorithms are given in the appendix. Existing algorithms for preparing these states [13–17]tend not to be e�cient for machine learning applications or do not o↵er clear evidence of a quantum speedup. Theine�ciency of [15, 17] is a consequence of the uniform initial state having small overlap with the Gibbs state. Thecomplexities of these prior algorithms, along with our own, are given in Table I
Our algorithms address this problem by using a non-uniform prior distribution for the probabilities of each configu-ration, which is motivated by the fact that we know a priori from the weights and the biases that certain configurationswill be less likely than others. We obtain this distribution by using a mean–field (MF) approximation to the config-uration probabilities. This approximation is classically e�cient and typically provides a good approximation to theGibbs states observed in practical machine learning problems [7, 18, 19]. Our algorithms exploit this prior knowledgeto refine the Gibbs state from copies of the MF state. This allows the Gibbs distribution to be prepared e�cientlyand exactly if the two states are su�ciently close.
The MF approximation, Q(v, h), is defined to be the product distribution that minimizes the Kullback–Leiblerdivergence KL(Q||P ). The fact that it is a product distribution means that it can be e�ciently computed and also
N. Wiebe, A. Kapoor, K. M. Svore, 2015, `Quantum Deep Learning'
Boltzmann machine
v ) [v1
, v2
, ..., vV ] visible units
h ) [h1
, h2
, ..., hH] hidden units
vi = ±1 hi = ±1
z = (v,h)
Energy : E(v,h) = �X
i
bizi �X
i,j
wi,jzizj
Probability to observe state v of visible units : Pv = �P
h e�E(v,h)
Z
Partition function : Z =
X
z
e�E(v,h)
How to train a Boltzmann machine
Let’s say you have a certain number of state vectors,
(each with dimension equal to the number of visible units)
The goal is find a Boltzmann distribution, in which the training vectors
have high probabiity
Thus, we want to maximize the following function:
L /X
v2training set
logPv
Pv =
Ph e�E(v,h)
Z
How to train a Boltzmann machine
Let’s say you have a certain number of state vectors,
(each with dimension equal to the number of visible units)
The goal is to find a Boltzmann distribution, in which the training
vectors have high probability
Thus, we want to maximize the following function:
L /X
v2training set
logPv
@L@wi,j
/ hvihjitraining � hvihjimodel
How to train a Boltzmann machine
Another way to formulate the problem.
Let’s say you have a certain distribution of state vectors
Let’s denote it by P+
v
The goal is to find a Boltzmann machine, where Pv = P+
v
Thus, we want to maximize the following function:
L /X
v
P+
v logPv
Pv =
Ph e�E(v,h)
Z
How to use a Boltzmann machine for discrimination
G. Hinton, 2010, `A practical guide to training restricted Boltzmann machines'
Let’s say you have several classes of data. How to discriminate between them?
• Include a `label’ of the class as a visible unit• Train the BM• Try the test vector with all the different class
labels• The one with the lowest free energy is the most
likely class
The goal is to compute:
@L@wi,j
/ hvihjitraining � hvihjimodel
For restricted Boltzmann Machines (with bipartite graph structure),the first term is easy to compute using:
P (hj = 1|v) = sigm�bj + wi,jvi
�
However, the second term is hard to compute!
Run quantum annealing N times and compute
hvihjimodel =1
N
X
n
vni hnj
Training a general Boltzmann machine is difficult!
Training a general Boltzmann machine is difficult!
From WikiPedia:
the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths
Quantum computing can help!
N. Wiebe, A. Kapoor, K. M. Svore, 2015, `Quantum Deep Learning'
Quantum sampling for training a Boltzmann machine
M. H. Amin, 2015, `Searching for quantum speed-up in quasistatic quantum annealers’
Energy : E(v,h) = �X
i
bizi �X
i,j
wi,jzizj
Quantum system : H = �X
i
bi�zi �
X
i,j
wi,j�zi �
zj
D-wave’s annealer:
H = A(t)X
i
�xi � B(t)
2
4X
i
bi�zi +
X
i,j
wi,j�zi �
zj
3
5
A(t = 0) � B(t = 0) ⇡ 0 A(t) ⇡ 0 ⌧ B(t)
bi and wi,j are tunable parameters
Quantum sampling for training a Boltzmann machine
S. H. Adachi, M. P. Henderson, 2015, `Application of quantum annealing to training deep neural network’
The goal is to compute:
@L@wi,j
/ hvihjitraining � hvihjimodel
For restricted Boltzmann Machines (with bipartite graph structure),the first term is easy to compute using:
P (hj = 1|v) = sigm�bj + wi,jvi
�
However, the second term is hard to compute!
Run quantum annealing N times and compute
hvihjimodel =1
N
X
n
vni hnj
Combine machine learning with quantum dynamics calculations to solve problems generally considered unfeasible
Complexity of many-body quantum systems
Disorder
Interactions
Complexity of many-body quantum systems
Disorder
Interactions
Complexity of many-body quantum systems
Disorder
Inte
ract
ions
Anderson localization
??? Many-body localization
Classical chaos
weak localization
Non-Markovian environments
Amenable to theoretical study
Disorder
Inte
ract
ions
Anderson localization
??? Many-body localization
Classical chaos
weak localization
Non-Markovian environments
Machine Learning for extrapolation of quantum properties
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
PhaseI
PhaseII
PhaseIII
Hamiltonian parameters
Can Phase I be used to predict the properties of Phases II and III?
Machine Learning for extrapolation of quantum properties
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Gaussian Process = Gaussian Random Function
−2
0
2
−5.0 −2.5 0.0 2.5 5.0Input variable
Ran
dom
func
tions
Gaussian Process Regression
●
●
●
●
●
−2
0
2
−4 −2 0 2Input variable
Dyn
amic
al o
bser
vabl
es
Gaussian Process Regression
●
●
●
●
●
−2
0
2
−5.0 −2.5 0.0 2.5 5.0Input variable
Ran
dom
func
tions
Gaussian Process Regression
●
●
●
●
●
−2
0
2
−5.0 −2.5 0.0 2.5 5.0Input variable
Con
stra
ined
rand
om fu
nctio
ns
⌦ = ⌦(x) with x =�x1, x2, · · · , xq
�>
Observable
Step 1. Calculate the observable at a few points of this multi-dimensional space
Gaussian Process Regression – how does it work?
Hamiltonian Parameters
Gaussian Process Regression – how does it work?
Step 2. Estimate correlations between the calculated points
R(x,x
0) =
qY
i=1
✓1+
p5|x
i
� x
0i
|!
i
+
5(x
i
� x
0i
)
2
3!
2
i
◆exp
✓�p5|x
i
� x
0i
|!
i
◆.
Assume some mathematical form for the correlations, e.g.,
and find the maximum likelihood estimates of the unknown coefficients.
Gaussian Process Regression – how does it work?
with
R(x,x
0) =
qY
i=1
✓1+
p5|x
i
� x
0i
|!
i
+
5(x
i
� x
0i
)
2
3!
2
i
◆exp
✓�p5|x
i
� x
0i
|!
i
◆.
µ(x) =
sX
j=1
h
j
(x)�
j
= h(x)>�
logL(!|Y n
) = �1
2
hnlog�
2
+ log(det(A)) + n
i
A =
0
BB@
1 R(x
1
,x
2
) · · · R(x
1
,x
n
)
R(x
2
,x
1
) 1
.
.
.
.
.
.
.
.
.
R(x
n
,x
1
) · · · 1
1
CCA (1)
We do this by maximizing the log-likelihood function:
�2(!) =1
n(Y n � �)>A�1(Y n � �) (1)
�(!) = (I>A�1I)�1I>A�1Y n (2)
Gaussian Process Regression – how does it work?
Step 2. Estimate correlations
R(x,x
0) =
qY
i=1
✓1+
p5|x
i
� x
0i
|!
i
+
5(x
i
� x
0i
)
2
3!
2
i
◆exp
✓�p5|x
i
� x
0i
|!
i
◆.
ØAny simple function works for interpolation●
●
●
●
●
−2
0
2
−5.0 −2.5 0.0 2.5 5.0Input variable
Con
stra
ined
rand
om fu
nctio
ns
ØExtrapolation is harder!
ØNecessary to find kernels capable of extrapolating physically!
Gaussian Process Regression – kernel optimization
How to find the best kernel?
2
tron and phonon dispersions, and Ve�ph is the electron-phonon coupling. We choose Ve�ph to be a combinationof two qualitatively di↵erent terms Ve�ph = ↵H1 + �H2,where
H1 =X
k,q
2ipN
[sin(k + q)� sin(k)] c†k+qck
⇣b†�q + bq
⌘(3)
describes the Su-Schrie↵er-Heeger (SSH) [35] electron-phonon coupling, and
H2 =X
k,q
2ipN
sin(q)c†k+qck
⇣b†�q + bq
⌘(4)
is the breathing-mode model [36]. The ground state bandof the model (2) represents polarons known to exhibittwo sharp transitions as the ratio ↵/� increases fromzero to large values [37]. At ↵ = 0, the model (2) de-scribes breathing-mode polarons, which have no sharptransitions [38]. At � = 0, the model (2) describesSSH polarons, which exhibit one sharp transition in thepolaron phase diagram [35]. At these transitions, theground state momentum of the polaron changes abruptly,as shown in Figure 2 (left). Our goal is to develop a MLmethod that, using properties from one of the phases,can predict the existence or absence of other phases.
Method. We use Gaussian Process (GP) regression asthe prediction method [41], described in detail in the Sup-plemental Material [42]. The goal of the prediction is toinfer an unknown function f(·) given some inputs xi andoutputs yi. The assumption is that yi = f(xi). The func-tion f is generally multidimensional so xi is a vector.
GPs do not infer a single function f(·), but rather adistribution over functions, p(f |X,y), where X is a vec-tor of all known xi and y is a vector of the correspondingvalues yi. This distribution is assumed to be normal. Thejoint Gaussian distribution of random variables f(xi) ischaracterized by a mean µ(x) and a covariance matrixK(·, ·). The matrix elements of the covariance Ki,j arespecified by a kernel function k(xi,xj) that quantifies thesimilarity relation between the properties of the systemat two points xi and xj in the multi-dimensional space.
Prediction at x⇤ is done by computing the conditionaldistribution of f(x⇤) given y and X. The mean of theconditional distribution is [41]
µ(x⇤) =X
i
d(x⇤,xi)yi =X
i
↵ik(x⇤,xi) (5)
where ↵ = K�1y and d = K(x⇤,X)>K(X,X)�1. The
predicted mean µ(x⇤) can be viewed as a linear combina-tion of the training data yi or as a linear combination ofthe kernels connecting all training points xi and the pointx⇤, where the prediction is made. In order to train a GPmodel, one must choose an analytical representation forthe kernel function.
It is – in principle – possible to use Eq. (5) for both in-terpolation and extrapolation. However, the kernel func-tion is found by analyzing a given set of data in order
to produce accurate interpolation results. There is noguarantee that the same kernel function works for ex-
trapolation outside the range of available data.
No kernel
RBF MAT RQ LIN
RQ ⇥ LIN· · ·RQ + MAT · · · RQ + RBF
RQ ⇥ LIN + RBF· · ·RQ ⇥ LIN ⇥ RBF · · · RQ ⇥ LIN + MAT
FIG. 1. Schematic diagram of the kernel construction methodemployed to develop a Gaussian Process model with extrap-olation power. At each iteration, the kernel with the highestBayesian information criterion (11) is selected.
This is exemplified by the fact that the choice ofk(x⇤,xi) in Eq. (5) is not unique. For example, k can beapproximated by any of the following functions:
kLIN(xi,xj) = x
>i xj (6)
kRBF(xi,xj) = exp
✓�1
2r2(xi,xj)
◆(7)
kMAT(xi,xj) =
✓1 +
p5r2(xi,xj) +
5
3r2(xi,xj)
◆
⇥ exp⇣�p5r2(xi,xj)
⌘(8)
kRQ(xi,xj) =
✓1 +
|xi � xj |22↵`2
◆�↵
(9)
where r2(xi,xj) = (xi � xj)> ⇥M ⇥ (xi � xj) and M isa diagonal matrix with di↵erent length-scales `d for eachdimension of xi. The length-scale parameters `d, ` and↵ are the free parameters. We describe them collectivelyby ✓. A GP is trained by finding the estimate of ✓ (de-noted by ✓) that maximizes the logarithm of the marginal
likelihood function:
log p(y|X, ✓,Mi) = �1
2y
>K�1y � 1
2log |K|� n
2log 2⇡
(10)
For solving an interpolation problem, it is su�cient tochoose any simple kernel (6) - (9). The e�ciency of theinterpolation depends on the kernel, but in the limit ofa large number of training points yi, any simple kernelfunction produces accurate results [41]. Eq. (5) showsthat extrapolation is clearly sensitive to the particularchoice of the kernel function. A possible solution to thisproblem is to use more complex functions for the kernels.
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Gaussian Process Regression – kernel optimization
No kernel
RBF MAT
RQ
LIN
RQ ⇥ LIN
· · ·RQ + MAT
· · ·RQ + RBF
RQ ⇥ LIN + RBF
· · ·RQ ⇥ LIN ⇥ RBF
· · ·RQ ⇥ LIN + MAT
BIC = �1
2
hnlog�2 + log(det(A)) + n
i� 1
2
M log n
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Goal: extrapolate quantum properties across phase transitions
How? Can’t use properties that diverge at the transitions
Idea: find the quantity that varies smoothly across the transition, “learn it”, extrapolate it, and derive the discontinuous properties from it
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
ˆH = �J
2
X
i,j
~Si · ~Sj
Free energy density at temperature T:
f (T,m) ⇡ 1
2
✓1� Tc
T
◆m2
+
1
12
✓TcT
◆3
m4
BIC = �1
2
hnlog�2 + log(det(A)) + n
i� 1
2
M log n
Heisenberg model
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
ˆH = �J
2
X
i,j
~Si · ~Sj
Free energy density at temperature T:
f (T,m) ⇡ 1
2
✓1� Tc
T
◆m2
+
1
12
✓TcT
◆3
m4
BIC = �1
2
hnlog�2 + log(det(A)) + n
i� 1
2
M log n
Heisenberg model
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Extrapolation across phase transitions
No kernel
RBF MAT
RQ
LIN
RQ ⇥ LIN
· · ·RQ + MAT
· · ·RQ + RBF
RQ ⇥ LIN + RBF
· · ·RQ ⇥ LIN ⇥ RBF
· · ·RQ ⇥ LIN + MAT
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
Machine Learning for extrapolation
Rodrigo Vargas, John Sous, Mona Berciu and RK, Phys. Rev. Lett. (2018)
ØKernel complexity matters!
ØIt is difficult to build models with kernels that are very complex
ØCan kernels be optimized using a quantum computing device?
No kernel
RBF MAT
RQ
LIN
RQ ⇥ LIN
· · ·RQ + MAT
· · ·RQ + RBF
RQ ⇥ LIN + RBF
· · ·RQ ⇥ LIN ⇥ RBF
· · ·RQ ⇥ LIN + MAT