Matrix Di erential Calculus with Applications in Statistics and … · 2018. 11. 16. · Matrix Di erential Calculus with Applications in Statistics and Econometrics Third Edition

Matrix Differential Calculus

with Applications in Statistics

and Econometrics

WILEY SERIES IN PROBABILITY AND STATISTICS

Established by Walter E. Shewhart and Samuel S. Wilks

Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice,Geof H. Givens, Harvey Goldstein, Geert Molenberghs, David W. Scott,Adrian F. M. Smith, and Ruey S. Tsay

Editors Emeriti: J. Stuart Hunter, Iain M. Johnstone, Joseph B. Kadane,and Jozef L. Teugels

The Wiley Series in Probability and Statistics is well established and author-itative. It covers many topics of current research interest in both pure andapplied statistics and probability theory. Written by leading statisticians andinstitutions, the titles span both state-of-the-art developments in the field andclassical methods.

Reflecting the wide range of current research in statistics, the series encom-passes applied, methodological, and theoretical statistics, ranging from ap-plications and new techniques made possible by advances in computerizedpractice to rigorous treatment of theoretical approaches. This series providesessential and invaluable reading for all statisticians, whether in academia, in-dustry, government, or research.

A complete list of the titles in this series can be found athttp://www.wiley.com/go/wsps

Matrix Differential Calculus

with Applications in Statistics

and Econometrics

Third Edition

Jan R. MagnusDepartment of Econometrics and Operations Research,Vrije Universiteit Amsterdam, The Netherlands

and

Heinz Neudecker†Amsterdam School of Economics,University of Amsterdam, The Netherlands

This third edition first published 2019

c© 2019 John Wiley & Sons

Edition History

John Wiley & Sons (1e, 1988) and John Wiley & Sons (2e, 1999)

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system,

or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording

or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material

from this title is available at http://www.wiley.com/go/permissions.

The right of Jan R. Magnus and Heinz Neudecker to be identified as the authors of this work has

been asserted in accordance with law.

Registered Offices

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley

products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some

content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

While the publisher and authors have used their best efforts in preparing this work, they make no

representations or warranties with respect to the accuracy or completeness of the contents of this

work and specifically disclaim all warranties, including without limitation any implied warranties

of merchantability or fitness for a particular purpose. No warranty may be created or extended

by sales representatives, written sales materials or promotional statements for this work. The

fact that an organization, website, or product is referred to in this work as a citation and/or

potential source of further information does not mean that the publisher and authors endorse the

information or services the organization, website, or product may provide or recommendations

it may make. This work is sold with the understanding that the publisher is not engaged in

rendering professional services. The advice and strategies contained herein may not be suitable

for your situation. You should consult with a specialist where appropriate. Further, readers should

be aware that websites listed in this work may have changed or disappeared between when this

work was written and when it is read. Neither the publisher nor authors shall be liable for any

loss of profit or any other commercial damages, including but not limited to special, incidental,

consequential, or other damages.

Library of Congress Cataloging-in-Publication Data applied for

ISBN: 9781119541202

Cover design by Wiley

Cover image: c© phochi/Shutterstock

Typeset by the author in LATEX

10 9 8 7 6 5 4 3 2 1

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

Part One — Matrices

1 Basic properties of vectors and matrices 31 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Matrices: addition and multiplication . . . . . . . . . . . . . . . 44 The transpose of a matrix . . . . . . . . . . . . . . . . . . . . . 65 Square matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Linear forms and quadratic forms . . . . . . . . . . . . . . . . . 77 The rank of a matrix . . . . . . . . . . . . . . . . . . . . . . . . 98 The inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 The determinant . . . . . . . . . . . . . . . . . . . . . . . . . . 1010 The trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111 Partitioned matrices . . . . . . . . . . . . . . . . . . . . . . . . 1212 Complex matrices . . . . . . . . . . . . . . . . . . . . . . . . . 1413 Eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . 1414 Schur’s decomposition theorem . . . . . . . . . . . . . . . . . . 1715 The Jordan decomposition . . . . . . . . . . . . . . . . . . . . . 1816 The singular-value decomposition . . . . . . . . . . . . . . . . . 2017 Further results concerning eigenvalues . . . . . . . . . . . . . . 2018 Positive (semi)definite matrices . . . . . . . . . . . . . . . . . . 2319 Three further results for positive definite matrices . . . . . . . 2520 A useful result . . . . . . . . . . . . . . . . . . . . . . . . . . . 2621 Symmetric matrix functions . . . . . . . . . . . . . . . . . . . . 27Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 28Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2 Kronecker products, vec operator, and Moore-Penrose inverse 311 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 The Kronecker product . . . . . . . . . . . . . . . . . . . . . . 313 Eigenvalues of a Kronecker product . . . . . . . . . . . . . . . . 334 The vec operator . . . . . . . . . . . . . . . . . . . . . . . . . . 345 The Moore-Penrose (MP) inverse . . . . . . . . . . . . . . . . . 36

v

vi Contents

6 Existence and uniqueness of the MP inverse . . . . . . . . . . . 377 Some properties of the MP inverse . . . . . . . . . . . . . . . . 388 Further properties . . . . . . . . . . . . . . . . . . . . . . . . . 399 The solution of linear equation systems . . . . . . . . . . . . . 41Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 43Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Miscellaneous matrix results 471 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 The adjoint matrix . . . . . . . . . . . . . . . . . . . . . . . . . 473 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . 494 Bordered determinants . . . . . . . . . . . . . . . . . . . . . . . 515 The matrix equation AX = 0 . . . . . . . . . . . . . . . . . . . 516 The Hadamard product . . . . . . . . . . . . . . . . . . . . . . 527 The commutation matrix Kmn . . . . . . . . . . . . . . . . . . 548 The duplication matrix Dn . . . . . . . . . . . . . . . . . . . . 569 Relationship between Dn+1 and Dn, I . . . . . . . . . . . . . . 5810 Relationship between Dn+1 and Dn, II . . . . . . . . . . . . . . 5911 Conditions for a quadratic form to be positive (negative)

subject to linear constraints . . . . . . . . . . . . . . . . . . . . 6012 Necessary and sufficient conditions for r(A : B) = r(A) + r(B) 6313 The bordered Gramian matrix . . . . . . . . . . . . . . . . . . 6514 The equations X1A+X2B

′ = G1, X1B = G2 . . . . . . . . . . 67Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 69Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Part Two — Differentials: the theory

4 Mathematical preliminaries 731 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732 Interior points and accumulation points . . . . . . . . . . . . . 733 Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . 754 The Bolzano-Weierstrass theorem . . . . . . . . . . . . . . . . . 775 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786 The limit of a function . . . . . . . . . . . . . . . . . . . . . . . 797 Continuous functions and compactness . . . . . . . . . . . . . . 808 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819 Convex and concave functions . . . . . . . . . . . . . . . . . . . 83Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Differentials and differentiability 871 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883 Differentiability and linear approximation . . . . . . . . . . . . 904 The differential of a vector function . . . . . . . . . . . . . . . . 915 Uniqueness of the differential . . . . . . . . . . . . . . . . . . . 936 Continuity of differentiable functions . . . . . . . . . . . . . . . 94

Contents vii

7 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 958 The first identification theorem . . . . . . . . . . . . . . . . . . 969 Existence of the differential, I . . . . . . . . . . . . . . . . . . . 9710 Existence of the differential, II . . . . . . . . . . . . . . . . . . 9911 Continuous differentiability . . . . . . . . . . . . . . . . . . . . 10012 The chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . 10013 Cauchy invariance . . . . . . . . . . . . . . . . . . . . . . . . . 10214 The mean-value theorem for real-valued functions . . . . . . . . 10315 Differentiable matrix functions . . . . . . . . . . . . . . . . . . 10416 Some remarks on notation . . . . . . . . . . . . . . . . . . . . . 10617 Complex differentiation . . . . . . . . . . . . . . . . . . . . . . 108Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 110Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6 The second differential 1111 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112 Second-order partial derivatives . . . . . . . . . . . . . . . . . . 1113 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . . . . 1124 Twice differentiability and second-order approximation, I . . . 1135 Definition of twice differentiability . . . . . . . . . . . . . . . . 1146 The second differential . . . . . . . . . . . . . . . . . . . . . . . 1157 Symmetry of the Hessian matrix . . . . . . . . . . . . . . . . . 1178 The second identification theorem . . . . . . . . . . . . . . . . 1199 Twice differentiability and second-order approximation, II . . . 11910 Chain rule for Hessian matrices . . . . . . . . . . . . . . . . . . 12111 The analog for second differentials . . . . . . . . . . . . . . . . 12312 Taylor’s theorem for real-valued functions . . . . . . . . . . . . 12413 Higher-order differentials . . . . . . . . . . . . . . . . . . . . . . 12514 Real analytic functions . . . . . . . . . . . . . . . . . . . . . . . 12515 Twice differentiable matrix functions . . . . . . . . . . . . . . . 126Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 Static optimization 1291 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1292 Unconstrained optimization . . . . . . . . . . . . . . . . . . . . 1303 The existence of absolute extrema . . . . . . . . . . . . . . . . 1314 Necessary conditions for a local minimum . . . . . . . . . . . . 1325 Sufficient conditions for a local minimum: first-derivative test . 1346 Sufficient conditions for a local minimum: second-derivative test 1367 Characterization of differentiable convex functions . . . . . . . 1388 Characterization of twice differentiable convex functions . . . . 1419 Sufficient conditions for an absolute minimum . . . . . . . . . . 14210 Monotonic transformations . . . . . . . . . . . . . . . . . . . . 14311 Optimization subject to constraints . . . . . . . . . . . . . . . . 14412 Necessary conditions for a local minimum under constraints . . 14513 Sufficient conditions for a local minimum under constraints . . 14914 Sufficient conditions for an absolute minimum under constraints 154

viii Contents

15 A note on constraints in matrix form . . . . . . . . . . . . . . . 15516 Economic interpretation of Lagrange multipliers . . . . . . . . . 155Appendix: the implicit function theorem . . . . . . . . . . . . . . . . 157Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

Part Three — Differentials: the practice

8 Some important differentials 1631 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1632 Fundamental rules of differential calculus . . . . . . . . . . . . 1633 The differential of a determinant . . . . . . . . . . . . . . . . . 1654 The differential of an inverse . . . . . . . . . . . . . . . . . . . 1685 Differential of the Moore-Penrose inverse . . . . . . . . . . . . . 1696 The differential of the adjoint matrix . . . . . . . . . . . . . . . 1727 On differentiating eigenvalues and eigenvectors . . . . . . . . . 1748 The continuity of eigenprojections . . . . . . . . . . . . . . . . 1769 The differential of eigenvalues and eigenvectors: symmetric case 18010 Two alternative expressions for dλ . . . . . . . . . . . . . . . . 18311 Second differential of the eigenvalue function . . . . . . . . . . 185Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 186Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9 First-order differentials and Jacobian matrices 1911 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1912 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1923 Derisatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1924 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1945 Identification of Jacobian matrices . . . . . . . . . . . . . . . . 1966 The first identification table . . . . . . . . . . . . . . . . . . . . 1977 Partitioning of the derivative . . . . . . . . . . . . . . . . . . . 1978 Scalar functions of a scalar . . . . . . . . . . . . . . . . . . . . 1989 Scalar functions of a vector . . . . . . . . . . . . . . . . . . . . 19810 Scalar functions of a matrix, I: trace . . . . . . . . . . . . . . . 19911 Scalar functions of a matrix, II: determinant . . . . . . . . . . . 20112 Scalar functions of a matrix, III: eigenvalue . . . . . . . . . . . 20213 Two examples of vector functions . . . . . . . . . . . . . . . . . 20314 Matrix functions . . . . . . . . . . . . . . . . . . . . . . . . . . 20415 Kronecker products . . . . . . . . . . . . . . . . . . . . . . . . . 20616 Some other problems . . . . . . . . . . . . . . . . . . . . . . . . 20817 Jacobians of transformations . . . . . . . . . . . . . . . . . . . 209Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

10 Second-order differentials and Hessian matrices 2111 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2112 The second identification table . . . . . . . . . . . . . . . . . . 2113 Linear and quadratic forms . . . . . . . . . . . . . . . . . . . . 2124 A useful theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Contents ix

5 The determinant function . . . . . . . . . . . . . . . . . . . . . 2146 The eigenvalue function . . . . . . . . . . . . . . . . . . . . . . 2157 Other examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 2158 Composite functions . . . . . . . . . . . . . . . . . . . . . . . . 2179 The eigenvector function . . . . . . . . . . . . . . . . . . . . . . 21810 Hessian of matrix functions, I . . . . . . . . . . . . . . . . . . . 21911 Hessian of matrix functions, II . . . . . . . . . . . . . . . . . . 219Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 220

Part Four — Inequalities

11 Inequalities 2251 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2252 The Cauchy-Schwarz inequality . . . . . . . . . . . . . . . . . . 2263 Matrix analogs of the Cauchy-Schwarz inequality . . . . . . . . 2274 The theorem of the arithmetic and geometric means . . . . . . 2285 The Rayleigh quotient . . . . . . . . . . . . . . . . . . . . . . . 2306 Concavity of λ1 and convexity of λn . . . . . . . . . . . . . . . 2327 Variational description of eigenvalues . . . . . . . . . . . . . . . 2328 Fischer’s min-max theorem . . . . . . . . . . . . . . . . . . . . 2349 Monotonicity of the eigenvalues . . . . . . . . . . . . . . . . . . 23610 The Poincare separation theorem . . . . . . . . . . . . . . . . . 23611 Two corollaries of Poincare’s theorem . . . . . . . . . . . . . . 23712 Further consequences of the Poincare theorem . . . . . . . . . . 23813 Multiplicative version . . . . . . . . . . . . . . . . . . . . . . . 23914 The maximum of a bilinear form . . . . . . . . . . . . . . . . . 24115 Hadamard’s inequality . . . . . . . . . . . . . . . . . . . . . . . 24216 An interlude: Karamata’s inequality . . . . . . . . . . . . . . . 24217 Karamata’s inequality and eigenvalues . . . . . . . . . . . . . . 24418 An inequality concerning positive semidefinite matrices . . . . . 24519 A representation theorem for (

∑api )

1/p . . . . . . . . . . . . . 246

20 A representation theorem for (trAp)1/p . . . . . . . . . . . . . . 24721 Holder’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 24822 Concavity of log|A| . . . . . . . . . . . . . . . . . . . . . . . . . 25023 Minkowski’s inequality . . . . . . . . . . . . . . . . . . . . . . . 25124 Quasilinear representation of |A|1/n . . . . . . . . . . . . . . . . 25325 Minkowski’s determinant theorem . . . . . . . . . . . . . . . . . 25526 Weighted means of order p . . . . . . . . . . . . . . . . . . . . . 25627 Schlomilch’s inequality . . . . . . . . . . . . . . . . . . . . . . . 25828 Curvature properties of Mp(x, a) . . . . . . . . . . . . . . . . . 25929 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26030 Generalized least squares . . . . . . . . . . . . . . . . . . . . . 26131 Restricted least squares . . . . . . . . . . . . . . . . . . . . . . 26232 Restricted least squares: matrix version . . . . . . . . . . . . . 264Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 265Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

x Contents

Part Five — The linear model

12 Statistical preliminaries 2731 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2732 The cumulative distribution function . . . . . . . . . . . . . . . 2733 The joint density function . . . . . . . . . . . . . . . . . . . . . 2744 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2745 Variance and covariance . . . . . . . . . . . . . . . . . . . . . . 2756 Independence of two random variables . . . . . . . . . . . . . . 2777 Independence of n random variables . . . . . . . . . . . . . . . 2798 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2799 The one-dimensional normal distribution . . . . . . . . . . . . . 27910 The multivariate normal distribution . . . . . . . . . . . . . . . 28011 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 282Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

13 The linear regression model 2851 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2852 Affine minimum-trace unbiased estimation . . . . . . . . . . . . 2863 The Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . 2874 The method of least squares . . . . . . . . . . . . . . . . . . . . 2905 Aitken’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 2916 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . 2937 Estimable functions . . . . . . . . . . . . . . . . . . . . . . . . 2958 Linear constraints: the case M(R′) ⊂ M(X ′) . . . . . . . . . . 2969 Linear constraints: the general case . . . . . . . . . . . . . . . . 30010 Linear constraints: the case M(R′) ∩M(X ′) = 0 . . . . . . . 30211 A singular variance matrix: the case M(X) ⊂ M(V ) . . . . . . 30412 A singular variance matrix: the case r(X ′V +X) = r(X) . . . . 30513 A singular variance matrix: the general case, I . . . . . . . . . . 30714 Explicit and implicit linear constraints . . . . . . . . . . . . . . 30715 The general linear model, I . . . . . . . . . . . . . . . . . . . . 31016 A singular variance matrix: the general case, II . . . . . . . . . 31117 The general linear model, II . . . . . . . . . . . . . . . . . . . . 31418 Generalized least squares . . . . . . . . . . . . . . . . . . . . . 31519 Restricted least squares . . . . . . . . . . . . . . . . . . . . . . 316Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 318Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

14 Further topics in the linear model 3211 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3212 Best quadratic unbiased estimation of σ2 . . . . . . . . . . . . 3223 The best quadratic and positive unbiased estimator of σ2 . . . 3224 The best quadratic unbiased estimator of σ2 . . . . . . . . . . . 3245 Best quadratic invariant estimation of σ2 . . . . . . . . . . . . 326

Contents xi

6 The best quadratic and positive invariant estimator of σ2 . . . 3277 The best quadratic invariant estimator of σ2 . . . . . . . . . . . 3298 Best quadratic unbiased estimation: multivariate normal case . 3309 Bounds for the bias of the least-squares estimator of σ2, I . . . 33210 Bounds for the bias of the least-squares estimator of σ2, II . . . 33311 The prediction of disturbances . . . . . . . . . . . . . . . . . . 33512 Best linear unbiased predictors with scalar variance matrix . . 33613 Best linear unbiased predictors with fixed variance matrix, I . . 33814 Best linear unbiased predictors with fixed variance matrix, II . 34015 Local sensitivity of the posterior mean . . . . . . . . . . . . . . 34116 Local sensitivity of the posterior precision . . . . . . . . . . . . 342Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

Part Six — Applications to maximum likelihood estimation

15 Maximum likelihood estimation 3471 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3472 The method of maximum likelihood (ML) . . . . . . . . . . . . 3473 ML estimation of the multivariate normal distribution . . . . . 3484 Symmetry: implicit versus explicit treatment . . . . . . . . . . 3505 The treatment of positive definiteness . . . . . . . . . . . . . . 3516 The information matrix . . . . . . . . . . . . . . . . . . . . . . 3527 ML estimation of the multivariate normal distribution:

distinct means . . . . . . . . . . . . . . . . . . . . . . . . . . . 3548 The multivariate linear regression model . . . . . . . . . . . . . 3549 The errors-in-variables model . . . . . . . . . . . . . . . . . . . 35710 The nonlinear regression model with normal errors . . . . . . . 35911 Special case: functional independence of mean and

variance parameters . . . . . . . . . . . . . . . . . . . . . . . . 36112 Generalization of Theorem 15.6 . . . . . . . . . . . . . . . . . . 362Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . . . . . . 364Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

16 Simultaneous equations 3671 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3672 The simultaneous equations model . . . . . . . . . . . . . . . . 3673 The identification problem . . . . . . . . . . . . . . . . . . . . . 3694 Identification with linear constraints on B and Γ only . . . . . 3715 Identification with linear constraints on B, Γ, and Σ . . . . . . 3716 Nonlinear constraints . . . . . . . . . . . . . . . . . . . . . . . . 3737 FIML: the information matrix (general case) . . . . . . . . . . 3748 FIML: asymptotic variance matrix (special case) . . . . . . . . 3769 LIML: first-order conditions . . . . . . . . . . . . . . . . . . . . 37810 LIML: information matrix . . . . . . . . . . . . . . . . . . . . . 38111 LIML: asymptotic variance matrix . . . . . . . . . . . . . . . . 383Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 388

xii Contents

17 Topics in psychometrics 3891 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3892 Population principal components . . . . . . . . . . . . . . . . . 3903 Optimality of principal components . . . . . . . . . . . . . . . . 3914 A related result . . . . . . . . . . . . . . . . . . . . . . . . . . . 3925 Sample principal components . . . . . . . . . . . . . . . . . . . 3936 Optimality of sample principal components . . . . . . . . . . . 3957 One-mode component analysis . . . . . . . . . . . . . . . . . . 3958 One-mode component analysis and sample principal

components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3989 Two-mode component analysis . . . . . . . . . . . . . . . . . . 39910 Multimode component analysis . . . . . . . . . . . . . . . . . . 40011 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 40412 A zigzag routine . . . . . . . . . . . . . . . . . . . . . . . . . . 40713 A Newton-Raphson routine . . . . . . . . . . . . . . . . . . . . 40814 Kaiser’s varimax method . . . . . . . . . . . . . . . . . . . . . . 41215 Canonical correlations and variates in the population . . . . . . 41416 Correspondence analysis . . . . . . . . . . . . . . . . . . . . . . 41717 Linear discriminant analysis . . . . . . . . . . . . . . . . . . . . 418Bibliographical notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

Part Seven — Summary

18 Matrix calculus: the essentials 4231 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4232 Differentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4243 Vector calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . 4264 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4295 Least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4316 Matrix calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . 4327 Interlude on linear and quadratic forms . . . . . . . . . . . . . 4348 The second differential . . . . . . . . . . . . . . . . . . . . . . . 4349 Chain rule for second differentials . . . . . . . . . . . . . . . . . 43610 Four examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 43811 The Kronecker product and vec operator . . . . . . . . . . . . . 43912 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44113 The commutation matrix . . . . . . . . . . . . . . . . . . . . . 44214 From second differential to Hessian . . . . . . . . . . . . . . . . 44315 Symmetry and the duplication matrix . . . . . . . . . . . . . . 44416 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . 445Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449

Index of symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

Subject index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471

Preface

Preface to the first edition

There has been a long-felt need for a book that gives a self-contained andunified treatment of matrix differential calculus, specifically written for econo-metricians and statisticians. The present book is meant to satisfy this need.It can serve as a textbook for advanced undergraduates and postgraduates ineconometrics and as a reference book for practicing econometricians. Math-ematical statisticians and psychometricians may also find something to theirliking in the book.

When used as a textbook, it can provide a full-semester course. Reason-able proficiency in basic matrix theory is assumed, especially with the use ofpartitioned matrices. The basics of matrix algebra, as deemed necessary fora proper understanding of the main subject of the book, are summarized inPart One, the first of the book’s six parts. The book also contains the es-sentials of multivariable calculus but geared to and often phrased in terms ofdifferentials.

The sequence in which the chapters are being read is not of great conse-quence. It is fully conceivable that practitioners start with Part Three (Differ-entials: the practice) and, dependent on their predilections, carry on to PartsFive or Six, which deal with applications. Those who want a full understand-ing of the underlying theory should read the whole book, although even thenthey could go through the necessary matrix algebra only when the specificneed arises.

Matrix differential calculus as presented in this book is based on differen-tials, and this sets the book apart from other books in this area. The approachvia differentials is, in our opinion, superior to any other existing approach.Our principal idea is that differentials are more congenial to multivariablefunctions as they crop up in econometrics, mathematical statistics, or psycho-metrics than derivatives, although from a theoretical point of view the twoconcepts are equivalent.

The book falls into six parts. Part One deals with matrix algebra. It lists,and also often proves, items like the Schur, Jordan, and singular-value de-compositions; concepts like the Hadamard and Kronecker products; the vecoperator; the commutation and duplication matrices; and the Moore-Penrose

xiii

xiv Preface

inverse. Results on bordered matrices (and their determinants) and (linearlyrestricted) quadratic forms are also presented here.

Part Two, which forms the theoretical heart of the book, is entirely de-voted to a thorough treatment of the theory of differentials, and presentsthe essentials of calculus but geared to and phrased in terms of differentials.First and second differentials are defined, ‘identification’ rules for Jacobianand Hessian matrices are given, and chain rules derived. A separate chapteron the theory of (constrained) optimization in terms of differentials concludesthis part.

Part Three is the practical core of the book. It contains the rules forworking with differentials, lists the differentials of important scalar, vector,and matrix functions (inter alia eigenvalues, eigenvectors, and the Moore-Penrose inverse) and supplies ‘identification’ tables for Jacobian and Hessianmatrices.

Part Four, treating inequalities, owes its existence to our feeling that econo-metricians should be conversant with inequalities, such as the Cauchy-Schwarzand Minkowski inequalities (and extensions thereof), and that they shouldalso master a powerful result like Poincare’s separation theorem. This part isto some extent also the case history of a disappointment. When we startedwriting this book we had the ambition to derive all inequalities by means ofmatrix differential calculus. After all, every inequality can be rephrased as thesolution of an optimization problem. This proved to be an illusion, due to thefact that the Hessian matrix in most cases is singular at the optimum point.

Part Five is entirely devoted to applications of matrix differential calculusto the linear regression model. There is an exhaustive treatment of estimationproblems related to the fixed part of the model under various assumptionsconcerning ranks and (other) constraints. Moreover, it contains topics relat-ing to the stochastic part of the model, viz. estimation of the error varianceand prediction of the error term. There is also a small section on sensitivityanalysis. An introductory chapter deals with the necessary statistical prelim-inaries.

Part Six deals with maximum likelihood estimation, which is of course anideal source for demonstrating the power of the propagated techniques. In thefirst of three chapters, several models are analysed, inter alia the multivariatenormal distribution, the errors-in-variables model, and the nonlinear regres-sion model. There is a discussion on how to deal with symmetry and positivedefiniteness, and special attention is given to the information matrix. The sec-ond chapter in this part deals with simultaneous equations under normalityconditions. It investigates both identification and estimation problems, subjectto various (non)linear constraints on the parameters. This part also discussesfull-information maximum likelihood (FIML) and limited-information maxi-mum likelihood (LIML), with special attention to the derivation of asymptoticvariance matrices. The final chapter addresses itself to various psychometricproblems, inter alia principal components, multimode component analysis,factor analysis, and canonical correlation.

All chapters contain many exercises. These are frequently meant to becomplementary to the main text.

Preface xv

A large number of books and papers have been published on the theory andapplications of matrix differential calculus. Without attempting to describetheir relative virtues and particularities, the interested reader may wish to con-sult Dwyer and Macphail (1948), Bodewig (1959), Wilkinson (1965), Dwyer(1967), Neudecker (1967, 1969), Tracy and Dwyer (1969), Tracy and Singh(1972), McDonald and Swaminathan (1973), MacRae (1974), Balestra (1976),Bentler and Lee (1978), Henderson and Searle (1979), Wong and Wong (1979,1980), Nel (1980), Rogers (1980), Wong (1980, 1985), Graham (1981), Mc-Culloch (1982), Schonemann (1985), Magnus and Neudecker (1985), Pollock(1985), Don (1986), and Kollo (1991). The papers by Henderson and Searle(1979) and Nel (1980), and Rogers’ (1980) book contain extensive bibliogra-phies.

The two authors share the responsibility for Parts One, Three, Five, andSix, although any new results in Part One are due to Magnus. Parts Two andFour are due to Magnus, although Neudecker contributed some results to PartFour. Magnus is also responsible for the writing and organization of the finaltext.

We wish to thank our colleagues F. J. H. Don, R. D. H. Heijmans, D. S. G.Pollock, and R. Ramer for their critical remarks and contributions. The great-est obligation is owed to Sue Kirkbride at the London School of Economicswho patiently and cheerfully typed and retyped the various versions of thebook. Partial financial support was provided by the Netherlands Organizationfor the Advancement of Pure Research (Z. W. O.) and the Suntory ToyotaInternational Centre for Economics and Related Disciplines at the LondonSchool of Economics.

London/Amsterdam Jan R. MagnusApril 1987 Heinz Neudecker

Preface to the first revised printing

Since this book first appeared — now almost four years ago — many of ourcolleagues, students, and other readers have pointed out typographical errorsand have made suggestions for improving the text. We are particularly grate-ful to R. D. H. Heijmans, J. F. Kiviet, I. J. Steyn, and G. Trenkler. We owethe greatest debt to F. Gerrish, formerly of the School of Mathematics in thePolytechnic, Kingston-upon-Thames, who read Chapters 1–11 with awesomeprecision and care and made numerous insightful suggestions and constructiveremarks. We hope that this printing will continue to trigger comments fromour readers.

London/Tilburg/Amsterdam Jan R. MagnusFebruary 1991 Heinz Neudecker

xvi Preface

Preface to the second edition

A further seven years have passed since our first revision in 1991. We arehappy to see that our book is still being used by colleagues and students.In this revision we attempted to reach three goals. First, we made a seri-ous attempt to keep the book up-to-date by adding many recent referencesand new exercises. Second, we made numerous small changes throughout thetext, improving the clarity of exposition. Finally, we corrected a number oftypographical and other errors.

The structure of the book and its philosophy are unchanged. Apart froma large number of small changes, there are two major changes. First, we in-terchanged Sections 12 and 13 of Chapter 1, since complex numbers need tobe discussed before eigenvalues and eigenvectors, and we corrected an error inTheorem 1.7. Second, in Chapter 17 on psychometrics, we rewrote Sections8–10 relating to the Eckart-Young theorem.

We are grateful to Karim Abadir, Paul Bekker, Hamparsum Bozdogan,Michael Browne, Frank Gerrish, Kaddour Hadri, Tonu Kollo, Shuangzhe Liu,Daan Nel, Albert Satorra, Kazuo Shigemasu, Jos ten Berge, Peter ter Berg,Gotz Trenkler, Haruo Yanai, and many others for their thoughtful and con-structive comments. Of course, we welcome further comments from our read-ers.

Tilburg/Amsterdam Jan R. MagnusMarch 1998 Heinz Neudecker

Preface to the third edition

Twenty years have passed since the appearance of the second edition andthirty years since the book first appeared. This is a long time, but the bookstill lives. Unfortunately, my coauthor Heinz Neudecker does not; he died inDecember 2017. Heinz was my teacher at the University of Amsterdam andI was fortunate to learn the subject of matrix calculus through differentials(then in its infancy) from his lectures and personal guidance. This techniqueis still a remarkably powerful tool, and Heinz Neudecker must be regarded asits founding father.

The original text of the book was written on a typewriter and then handedover to the publisher for typesetting and printing. When it came to the sec-ond edition, the typeset material could no longer be found, which is why thesecond edition had to be produced in an ad hoc manner which was not sat-isfactory. Many people complained about this, to me and to the publisher,and the publisher offered us to produce a new edition, freshly typeset, whichwould look good. In the mean time, my Russian colleagues had proposed totranslate the book into Russian, and I realized that this would only be feasibleif they had a good English LATEX text. So, my secretary Josette Janssen atTilburg University and I produced a LATEX text with expert advice from JozefPijnenburg. In the process of retyping the manuscript, many small changes

Preface xvii

were made to improve the readability and consistency of the text, but thestructure of the book was not changed. The English LATEX version was thenused as the basis for the Russian edition,

Matrichnoe Differenzial’noe Ischislenie s Prilozhenijamik Statistike i Ekonometrike,

translated by my friends Anatoly Peresetsky and Pavel Katyshev, and pub-lished by Fizmatlit Publishing House, Moscow, 2002. The current third editionis based on this English LATEX version, although I have taken the opportunityto make many improvements to the presentation of the material.

Of course, this was not the only reason for producing a third edition. Itwas time to take a fresh look at the material and to update the references. Ifelt it was appropriate to stay close to the original text, because this is thebook that Heinz and I conceived and the current text is a new edition, not anew book. The main changes relative to the second edition are as follows:

• Some subjects were treated insufficiently (some of my friends wouldsay ‘incorrectly’) and I have attempted to repair these omissions. Thisapplies in particular to the discussion on matrix functions (Section 1.21),complex differentiation (Section 5.17), and Jacobians of transformations(Section 9.17).

• The text on differentiating eigenvalues and eigenvectors and associatedcontinuity issues has been rewritten, see Sections 8.7–8.11.

• Chapter 10 has been completely rewritten, because I am now convincedthat it is not useful to define Hessian matrices for vector or matrixfunctions. So I now define Hessian matrices only for scalar functions andfor individual components of vector functions and individual elementsof matrix functions. This makes life much easier.

• I have added two additional sections at the end of Chapter 17 on psy-chometrics, relating to correspondence analysis and linear discriminantanalysis.

• Chapter 18 is new. It can be read without consulting the other chaptersand provides a summary of the whole book. It can therefore be usedas an introduction to matrix calculus for advanced undergraduates orMaster’s and PhD students in economics, statistics, mathematics, andengineering who want to know how to apply matrix calculus withoutgoing into all the theoretical details.

In addition, many small changes have been made, references have been up-dated, and exercises have been added. Over the past 30 years, I received manyqueries, problems, and requests from readers, about once every 2 weeks, whichamounts to about 750 queries in 30 years. I responded to all of them and anumber of these problems appear in the current text as exercises.

I am grateful to Don Andrews, Manuel Arellano, Richard Baillie, LucBauwens, Andrew Chesher, Gerda Claeskens, Russell Davidson, Jean-Marie

xviii Preface

Dufour, Ronald Gallant, Eric Ghysels, Bruce Hansen, Grant Hillier, ChengHsiao, Guido Imbens, Guido Kuersteiner, Offer Lieberman, Esfandiar Maa-soumi, Whitney Newey, Kazuhiro Ohtani, Enrique Sentana, Cezary Sieluzycki,Richard Smith, Gotz Trenkler, and Farshid Vahid for general encouragementand specific suggestions; to Henk Pijls for answering my questions on complexdifferentiation and Michel van de Velden for help on psychometric issues; toJan Brinkhuis, Chris Muris, Franco Peracchi, Andrey Vasnev, Wendun Wang,and Yuan Yue on commenting on the new Chapter 18; to Ang Li for excep-tional research assistance in updating the literature; and to Ilka van de Wervefor expertly redrawing the figures. No blame attaches to any of these peoplein case there are remaining errors, ambiguities, or omissions; these are entirelymy own responsibility, especially since I have not always followed their advice.

Cross-References . The numbering of theorems, propositions, corollaries, fig-ures, tables, assumptions, examples, and definitions is with two digits, so thatTheorem 3.5 refers to Theorem 5 in Chapter 3. Sections are numbered 1, 2,. . .within each chapter but always referenced with two digits so that Section 5in Chapter 3 is referred to as Section 3.5. Equations are numbered (1), (2),. . . within each chapter, and referred to with one digit if it refers to the samechapter; if it refers to another chapter we write, for example, see Equation (16)in Chapter 5. Exercises are numbered 1, 2,. . . after a section.

Notation. Special symbols are used to denote the derivative (matrix) D andthe Hessian (matrix) H. The differential operator is denoted by d. The thirdedition follows the notation of earlier editions with the following exceptions.First, the symbol for the vector (1, 1, . . . , 1)′ has been altered from a calli-graphic s to ı (dotless i); second, the symbol i for imaginary root has beenreplaced by the more common i; third, v(A), the vector indicating the es-sentially distinct components of a symmetric matrix A, has been replaced byvech(A); fourth, the symbols for expectation, variance, and covariance (pre-viously E , V , and C) have been replaced by E, var, and cov, respectively; andfifth, we now denote the normal distribution by N (previously N ). A list ofall symbols is presented in the Index of Symbols at the end of the book.

Brackets are used sparingly. We write trA instead of tr(A), while trABdenotes tr(AB), not (trA)B. Similarly, vecAB means vec(AB) and dXYmeans d(XY ). In general, we only place brackets when there is a possibilityof ambiguity.

I worked on the third edition between April and November 2018. I hope thebook will continue to be useful for a few more years, and of course I welcomecomments from my readers.

Amsterdam/Wapserveen Jan R. MagnusNovember 2018

Note to the reader:

This preview of the the third edition of Matrix Differential Cal-culus contains only Chapter 18, the final chapter. This chapteris different from the other chapters in the book and can be readindependently. It attempts to summarize matrix calculus for theuser who does not want to go into all the details.

Chapter 18 is available free of charge. When quoting resultsfrom this chapter, please refer to:

Magnus, J. R. and H. Neudecker (2019). Matrix Differential Cal-culus with Applications in Statistics and Econometrics, ThirdEdition, John Wiley, New York.

CHAPTER 18

Matrix calculus: the essentials

1 INTRODUCTION

This chapter differs from the other chapters in this book. It attempts tosummarize the theory and the practical applications of matrix calculus in a fewpages, leaving out all the subtleties that the typical user will not need. It alsoserves as an introduction for (advanced) undergraduates or Master’s and PhDstudents in economics, statistics, mathematics, and engineering, who want toknow how to apply matrix calculus without going into all the theoreticaldetails. The chapter can be read independently of the rest of the book.

We begin by introducing the concept of a differential, which lies at theheart of matrix calculus. The key advantage of the differential over the morecommon derivative is the following. Consider the linear vector function f(x) =Ax where A is an m× n matrix of constants. Then, f(x) is an m× 1 vectorfunction of an n × 1 vector x, and the derivative Df(x) is an m × n matrix(in this case, the matrix A). But the differential df remains an m× 1 vector.In general, the differential df of a vector function f = f(x) has the samedimension as f , irrespective of the dimension of the vector x, in contrast tothe derivative Df(x).

The advantage is even larger for matrices. The differential dF of a matrixfunction F (X) has the same dimension as F , irrespective of the dimension ofthe matrix X . The practical importance of working with differentials is hugeand will be demonstrated through many examples.

We next discuss vector calculus and optimization, with and without con-straints. We emphasize the importance of a correct definition and notation forthe derivative, present the ‘first identification theorem’, which links the firstdifferential with the first derivative, and apply these results to least squares.Then we extend the theory from vector calculus to matrix calculus and obtainthe differentials of the determinant and inverse.

Matrix Differential Calculus with Applications in Statistics and Econometrics,Third Edition. Jan R. Magnus and Heinz Neudecker.

c© 2019 John Wiley & Sons Ltd. Published 2019 by John Wiley & Sons Ltd.

423

424 Matrix calculus: the essentials [Ch. 18

A brief interlude on quadratic forms follows, the primary purpose of whichis to show that if x′Ax = 0 for all x, then this does not imply that A iszero, but only that A′ = −A. We then define the second differential andthe Hessian matrix, prove the ‘second identification theorem’, which links thesecond differential with the Hessian matrix, and discuss the chain rule forsecond differentials. The first part of this chapter ends with four examples.

In the second (more advanced) part, we introduce the vec operator andthe Kronecker product, and discuss symmetry (commutation and duplicationmatrices). Many examples are provided to clarify the technique. The chap-ter ends with an application to maximum likelihood estimation, where allelements discussed in the chapter come together.

The following notation is used. Unless specified otherwise, φ denotes ascalar function, f a vector function, and F a matrix function. Also, x denotesa scalar or vector argument, and X a matrix argument. All functions andvariables in this chapter are real. Parentheses are used sparingly. We writedX , trX , and vecX without parentheses, and also dXY , trXY , and vecXYinstead of d(XY ), tr(XY ), and vec(XY ). However, we write vech(X) withparentheses for historical reasons.

2 DIFFERENTIALS

We assume that the reader is familiar with high-school calculus. This includesnot only simple derivatives, such as

dx2

dx= 2x,

dex

dx= ex,

d sinx

dx= cosx, (1)

but also the chain rule, for example:

d(sinx)2

dx=d(sinx)2

d sinx

d sinx

dx= 2 sinx cosx = sin(2x).

We now introduce the concept of a differential, by expressing (1) as

dx2 = 2x dx, dex = ex dx, d sinx = cosx dx, (2)

where we write d rather than d to emphasize that this is a differential ratherthan a derivative. The two concepts are closely related, but they are not thesame.

The concept of differential may be confusing for students who remembertheir mathematics teacher explain to them that it is wrong to view dx2/dx asa fraction. They might wonder what dx and dx2 really are. What does dx2 =2x dx mean? From a geometric point of view, it means that if we replace thegraph of the function φ(x) = x2 at some value x by its linear approximation,that is, by the tangent line at the point (x, x2), then an increment dx in xleads to an increment dx2 = 2x dx in x2 in linear approximation. From analgebraic point of view, if we replace x by x+ dx (‘increment dx’), then φ(x)is replaced by

φ(x+ dx) = (x+ dx)2 = x2 + 2x dx+ (dx)2.

Sec. 2 ] Differentials 425

For small dx, the term (dx)2 will be very small and, if we ignore it, we obtainthe linear approximation x2 + 2x dx. The differential dx2 is, for a given valueof x, just a function of the real variable dx, given by the formula dx2 = 2x dx.

This may sound complicated, but working with differentials is easy. Thepassage from (1) to (2) holds generally for any (differentiable) real-valuedfunction φ, and the differential dφ is thus given by the formula

dφ =dφ(x)

dxdx.

Put differently,

dφ = α(x) dx ⇐⇒ dφ(x)

dx= α(x), (3)

where α may depend on x, but not on dx. Equation (3) is a special case ofthe first identification theorem (Theorem 18.1) in the next section. It showsthat we can identify the derivative from the differential (and vice versa), andit shows that the new concept differential is equivalent to the familiar conceptderivative. We will always work with the differential, as this has great practicaladvantages.

The differential is an operator, in fact a linear operator, and we have

da = 0, d(ax) = a dx,

for any scalar constant a, and

d(x+ y) = dx+ dy, d(x− y) = dx− dy.

For the product and the ratio, we have

d(xy) = (dx)y + x dy, d

(1

x

)= −dx

x2(x 6= 0),

and, in addition to the differential of the exponential function dex = exdx,

d logx =dx

x(x > 0).

The chain rule, well-known for derivatives, also applies to differentials andis then called Cauchy’s rule of invariance. For example,

d(sinx)2 = 2 sinx d sinx = 2 sinx cosx dx = sin(2x) dx, (4)

ordex

2

= ex2

dx2 = 2ex2

x dx,

or, combining the two previous examples,

desin x2

= esin x2

d sinx2 = esinx2

cosx2 dx2

= 2x esinx2

cosx2 dx.


The chain rule is a good example of the general principle that things areeasier — sometimes a bit, sometimes a lot — in terms of differentials thanin terms of derivatives. The chain rule in terms of differentials states thattaking differentials of functions preserves composition of functions. This iseasier than the chain rule in terms of derivatives. Consider, for example, thefunction z = h(x) = (sinx)2 as the composition of the functions

y = g(x) = sinx, z = f(y) = y2,

so that h(x) = f(g(x)). Then dy = cosx dx, and hence

dz = 2y dy = 2y cosx dx = 2 sinx cosx dx,

as expected.The chain rule is, of course, a key instrument in differential calculus. Sup-

pose we realize that x in (4) depends on t, say x = t2. Then, we do not need tocompute the differential of (sin t2)2 all over again. We can use (4) and simplywrite

d(sin t2)2 = sin(2t2) dt2 = 2t sin(2t2) dt.

The chain rule thus allows us to apply the rules of calculus sequentially, oneafter another.

In this section, we have only concerned ourselves with scalar functions ofa scalar argument, and the reader may wonder why we bother to introducedifferentials. They do not seem to have a great advantage over the more famil-iar derivatives. This is true, but when we come to vector functions of vectorarguments, then the advantage will become clear.

3 VECTOR CALCULUS

Let x (n× 1) and y (m× 1) be two vectors and let y be a function of x, sayy = f(x). What is the derivative of y with respect to x? To help us answerthis question, we first consider the linear equation

y = f(x) = Ax,

where A is an m× n matrix of constants. The derivative is A and we write

∂f(x)

∂x′= A. (5)

The notation ∂f(x)/∂x′ is just notation, nothing else. We sometimes writethe derivative as Df(x) or as Df , but we avoid the notation f ′(x) because thismay cause confusion with the transpose. The proposed notation emphasizesthat we differentiate an m × 1 column vector f with respect to a 1 × n rowvector x′, resulting in an m× n derivative matrix.

Sec. 3 ] Vector calculus 427

More generally, the derivative of f(x) is an m × n matrix containing allpartial derivatives ∂fi(x)/∂xj , but in a specific ordering, namely

∂f(x)

∂x′=

∂f1(x)/∂x1 ∂f1(x)/∂x2 . . . ∂f1(x)/∂xn∂f2(x)/∂x1 ∂f2(x)/∂x2 . . . ∂f2(x)/∂xn

......

...∂fm(x)/∂x1 ∂fm(x)/∂x2 . . . ∂fm(x)/∂xn

. (6)

There is only one definition of a vector derivative, and this is it. Of course,one can organize the mn partial derivatives in different ways, but these othercombinations of the partial derivatives are not derivatives, have no practicaluse, and should be avoided.

Notice that each row of the derivative in (6) contains the partial derivativesof one element of f with respect to all elements of x, and that each columncontains the partial derivatives of all elements of f with respect to one elementof x. This is an essential characteristic of the derivative. As a consequence,the derivative of a scalar function y = φ(x), such as y = a′x (where a is avector of constants), is a row vector; in this case, a′. So the derivative of a′xis a′, not a.

The rules in the previous section imply that the following rules applyto vector differentials, where x and y are vectors and a is a vector of realconstants, all of the same order:

da = 0, d(x′) = (dx)′, d(a′x) = a′dx,

d(x+ y) = dx+ dy, d(x− y) = dx− dy,

and

d(x′y) = (dx)′y + x′dy.

Now we can see the advantage of working with differentials rather than withderivatives. When we have an m× 1 vector y, which is a function of an n× 1vector of variables x, say y = f(x), then the derivative is an m×n matrix, butthe differential dy or df remains an m× 1 vector. This is relevant for vectorfunctions, and even more relevant for matrix functions and for second-orderderivatives, as we shall see later. The practical advantage of working withdifferentials is therefore that the order does not increase but always stays thesame.

Corresponding to the identification result (3), we have the following rela-tionship between the differential and the derivative.

Theorem 18.1 (first identification theorem):

df = A(x) dx ⇐⇒ ∂f(x)

∂x′= A(x).


This theorem shows that there is a one-to-one correspondence betweenfirst-order differentials and first-order derivatives. In other words, the differ-ential identifies the derivative.

Example 18.1: Consider the linear function φ(x) = a′x, where a is a vectorof constants. This gives

dφ = a′dx,

so that the derivative is a′, as we have seen before.

Example 18.2a: Next, consider the quadratic function φ(x) = x′Ax, whereA is a matrix of constants. Here, we have

dφ = (dx)′Ax+ x′A dx = x′A′dx+ x′A dx = x′(A+A′) dx.

The derivative is x′(A + A′), and in the special case where A is symmetric,the derivative is 2x′A.

Now suppose that z = f(y) and that y = g(x), so that z = f(g(x)). Then,

∂z

∂x′=

∂z

∂y′∂y

∂x′.

This is the chain rule for vector functions. The corresponding result for dif-ferentials is the following.

Theorem 18.2 (chain rule for first differentials): Let z = f(y) andy = g(x), so that z = f(g(x)). Then,

dz = A(y)B(x) dx,

where A(y) and B(x) are defined through

dz = A(y) dy, dy = B(x) dx.

Example 18.3: Let x = (x1, x2, x3)′ and

f(x) =

(x21 − x22x1x2x3

).

Then, the differential is

df =

(d(x21)− d(x22)d(x1x2x3)

)=

(2x1dx1 − 2x2 dx2

(dx1)x2x3 + x1(dx2)x3 + x1x2 dx3

)

=

(2x1 −2x2 0x2x3 x1x3 x1x2

)(dx1dx2dx3

),

Sec. 4 ] Optimization 429

which identifies the derivative as

∂f(x)

∂x′=

(2x1 −2x2 0x2x3 x1x3 x1x2

).

Example 18.4a: Let x = (x1, x2)′, y = (y1, y2)

′, and

φ(y) = ey1 sin y2, y1 = x1x22, y2 = x21x2.

Then,dφ = (dey1) sin y2 + ey1 d sin y2 = a(y)′dy,

where

a(y) = ey1(sin y2cos y2

), dy =

(dy1dy2

).

Also,

dy =

(x22 2x1x2

2x1x2 x21

)(dx1dx2

)= B(x) dx.

Hence,dφ = a(y)′ dy = a(y)′B(x) dx = c1 dx1 + c2 dx2,

where

c1 = x2ey1 (x2 sin y2 + 2x1 cos y2) ,

c2 = x1ey1 (x1 cos y2 + 2x2 sin y2) ,

so that the derivative is ∂φ(x)/∂x′ = (c1, c2).

4 OPTIMIZATION

Let φ(x) be a scalar differentiable function that we wish to optimize withrespect to an n × 1 vector x. Then we obtain the differential dφ = a(x)′ dx,and set a(x) = 0. Suppose, for example, that we wish to minimize the function

φ(x) =1

2x′Ax− b′x, (7)

where the matrix A is positive definite. The differential is

dφ = x′A dx− b′ dx = (Ax − b)′ dx.

(Recall that a positive definite matrix is symmetric, by definition.) The so-lution x needs to satisfy Ax − b = 0, and hence x = A−1b. The function φhas an absolute minimum at x, which can be seen by defining y = x− x andwriting

y′Ay = (x−A−1b)′A(x−A−1b) = 2φ(x) + b′A−1b.


Since A is positive definite, y′Ay has a minimum at y = 0 and hence φ(x) hasa minimum at x = x. This holds for the specific linear-quadratic function (7)and it holds more generally for any (strictly) convex function. Such functionsattain a (strict) absolute minimum.

Next suppose there is a restriction, say g(x) = 0. Then we need to opti-mize subject to the restriction, and we need Lagrangian theory. This worksas follows. First define the Lagrangian function, usually referred to as theLagrangian,

ψ(x) = φ(x) − λg(x),

where λ is the Lagrange multiplier. Then we obtain the differential of ψ withrespect to x,

dψ = dφ− λdg,

and set it equal to zero. The equations

∂φ(x)

∂x′= λ

∂g(x)

∂x′, g(x) = 0

are the first-order conditions. From these n+ 1 equations in n+ 1 unknowns(x and λ), we solve x and λ.

If the constraint g is a vector rather than a scalar, then we have not onebut several (say, m) constraints. In that case we need m multipliers and itworks like this. First, define the Lagrangian

ψ(x) = φ(x) − l′g(x),

where l = (λ1, λ2, . . . , λm)′ is a vector of Lagrange multipliers. Then, weobtain the differential of ψ with respect to x:

dψ = dφ− l′dg

and set it equal to zero. The equations

∂φ(x)

∂x′= l′

∂g(x)

∂x′, g(x) = 0

constitute n+m equations (the first-order conditions). If we can solve these

equations, then we obtain the solutions, say x and l.The Lagrangian method gives necessary conditions for a local constrained

extremum to occur at a given point x. But how do we know that this pointis in fact a maximum or a minimum? Sufficient conditions are available butthey may be difficult to verify. However, in the special case where φ is linear-quadratic (or more generally, convex) and g is linear, φ attains an absoluteminimum at the solution x under the constraint g(x) = 0.

Sec. 5 ] Least squares 431

5 LEAST SQUARES

Suppose we are given an n× 1 vector y and an n× k matrix X with linearlyindependent columns, so that r(X) = k. We wish to find a k × 1 vector β,such that Xβ is ‘as close as possible’ to y in the sense that the ‘error’ vectore = y−Xβ is minimized. A convenient scalar measure of the ‘error’ would bee′e and our objective is to minimize

φ(β) =e′e2

=(y −Xβ)′(y −Xβ)

2, (8)

where we note that we write e′e/2 rather than e′e. This makes no difference,since any β which minimizes e′e will also minimize e′e/2, but it is a commontrick, useful because we know that we are minimizing a quadratic function,so that a ‘2’ will appear in the derivative. The 1/2 neutralizes this 2.

Differentiating φ in (8) gives

dφ = e′ de = e′ d(y −Xβ) = −e′X dβ.

Hence, the optimum is obtained when X ′e = 0, that is, when X ′Xβ = X ′y,from which we obtain

β = (X ′X)−1X ′y, (9)

the least-squares solution.If there are constraints on β, say Rβ = r, then we need to solve

minimize φ(β)

subject to Rβ = r.

We assume that the m rows of R are linearly independent, and define theLagrangian

ψ(β) = (y −Xβ)′(y −Xβ)/2− l′(Rβ − r),

where l is a vector of Lagrange multipliers. The differential is

dψ = d(y −Xβ)′(y −Xβ)/2− l′ d(Rβ − r)

= (y −Xβ)′ d(y −Xβ)− l′R dβ

= −(y −Xβ)′X dβ − l′R dβ.

Setting the differential equal to zero and denoting the restricted estimatorsby β and l, we obtain the first-order conditions

(y −Xβ)′X + l′R = 0, Rβ = r,

or, written differently,

X ′Xβ −X ′y = R′ l, Rβ = r.


We do not know β but we know Rβ. Hence, we premultiply by R(X ′X)−1.

Letting β = (X ′X)−1X ′y as in (9), this gives

r − Rβ = R(X ′X)−1R′ l.

Since R has full row rank, we can solve for l:

l =(R(X ′X)−1R′)−1

(r −Rβ),

and hence for β:

β = β + (X ′X)−1R′ l = β + (X ′X)−1R′ (R(X ′X)−1R′)−1(r −Rβ).

Since the constraint is linear and the function φ is linear-quadratic as in (7), it

follows that the solution β indeed minimizes φ(β) = e′e/2 under the constraintRβ = r.

6 MATRIX CALCULUS

We have moved from scalar calculus to vector calculus, now we move fromvector calculus to matrix calculus. When discussing matrices we assume thatthe reader is familiar with matrix addition and multiplication, and also knowsthe concepts of a determinant |A| and an inverse A−1. An important functionof a square matrix A = (aij) is its trace, which is defined as the sum of thediagonal elements of A: trA =

∑i aii. We have

trA = trA′,

which is obvious because a matrix and its transpose have the same diagonalelements. Less obvious is

trA′B = trBA′

for any two matrices A and B of the same order (but not necessarily square).This follows because

trA′B =∑

j

(A′B)jj =∑

j

∑

i

aijbij

=∑

i

∑

j

bijaij =∑

i

(BA′)ii = trBA′. (10)

The rules for vector differentials in Section 3 carry over to matrix differentials.Let A be a matrix of constants and let α be a scalar. Then, for any X ,

dA = 0, d(αX) = α dX, d(X ′) = (dX)′,

and, for square X ,d trX = tr dX.

Sec. 6 ] Matrix calculus 433

If X and Y are of the same order, then

d(X + Y ) = dX + dY, d(X − Y ) = dX − dY,

and, if the matrix product XY is defined,

d(XY ) = (dX)Y +XdY.

Two less trivial differentials are the determinant and the inverse. For nonsin-gular X we have

d|X | = |X | trX−1 dX, (11)

and in particular, when |X | > 0,

d log |X | = d|X ||X | = trX−1 dX.

The proof of (11) is a little tricky and is omitted (in this chapter, but not inChapter 8).

The differential of the inverse is, for nonsingular X ,

dX−1 = −X−1(dX)X−1. (12)

This we can prove easily by considering the equation X−1X = I. Differenti-ating both sides gives

(dX−1)X +X−1dX = 0

and the result then follows by postmultiplying with X−1.The chain rule also applies to matrix functions. More precisely, if Z =

F (Y ) and Y = G(X), so that Z = F (G(X)), then

dZ = A(Y )B(X) dX,

where A(Y ) and B(X) are defined through

dZ = A(Y ) dY, dY = B(X) dX,

as in Theorem 18.2.Regarding constrained optimization, treated for vector functions in Sec-

tion 18.4, we note that this can be easily and elegantly extended to matrixconstraints. If we have a matrix G (rather than a vector g) of constraints anda matrix X (rather than a vector x) of variables, then we define a matrix ofmultipliers L = (λij) of the same dimension as G = (gij). The Lagrangianthen becomes

ψ(X) = φ(X)− trL′G(X),

where we have used the fact, also used in (10) above, that

trL′G =∑

i

∑

j

λijgij .


7 INTERLUDE ON LINEAR AND QUADRATIC FORMS

Before we turn from first to second differentials, that is, from linear forms toquadratic forms, we investigate under what conditions a linear or quadraticform vanishes. The sole purpose of this section is to help the reader appreciateTheorem 18.3 in the next section.

A linear form is an expression such as Ax. When Ax = 0, this does notimply that either A or x is zero. For example, if

A =

(1 −1

−2 23 −3

), x =

(11

),

then Ax = 0, but neither A = 0 nor x = 0.However, when Ax = 0 for every x, then A must be zero, which can be

seen by taking x to be each elementary vector ei in turn. (The ith elementaryvector is the vector with one in the ith position and zeros elsewhere.)

A quadratic form is an expression such as x′Ax. When x′Ax = 0, this doesnot imply that A = 0 or x = 0 or Ax = 0. Even when x′Ax = 0 for every x,it does not follow that A = 0, as the example

A =

(0 1

−1 0

)

demonstrates. This matrix A is skew-symmetric, that is, it satisfies A′ = −A.In fact, when x′Ax = 0 for every x then it follows that A must be skew-symmetric. This can be seen by taking x = ei which implies that aii = 0, andthen x = ei + ej which implies that aij + aji = 0.

In the special case where x′Ax = 0 for every x and A is symmetric, thenA is both symmetric (A′ = A) and skew-symmetric (A′ = −A), and henceA = 0.

8 THE SECOND DIFFERENTIAL

The second differential is simply the differential of the first differential:

d2f = d(df).

Higher-order differentials are similarly defined, but they are seldom needed.

Example 18.2b: Let φ(x) = x′Ax. Then, dφ = x′(A+A′) dx and

d2φ = d (x′(A+A′) dx) = (dx)′(A+A′) dx+ x′(A+A′) d2x

= (dx)′(A+A′) dx,

since d2x = 0.

Sec. 8 ] The second differential 435

The first differential leads to the first derivative (sometimes called theJacobian matrix ) and the second differential leads to the second derivative(called the Hessian matrix ). We emphasize that the concept of Hessian matrixis only useful for scalar functions, not for vector or matrix functions. When wehave a vector function f we shall consider the Hessian matrix of each elementof f separately, and when we have a matrix function F we shall consider theHessian matrix of each element of F separately.

Thus, let φ be a scalar function and let

dφ = a(x)′dx, da = (Hφ) dx, (13)

where

a(x)′ =∂φ(x)

∂x′, Hφ =

∂a(x)

∂x′=

∂

∂x′

(∂φ(x)

∂x′

)′.

The ijth element of the Hessian matrix Hφ is thus obtained by first calculatingaj(x) = ∂φ(x)/∂xj and then (Hφ)ij = ∂aj(x)/∂xi. The Hessian matrix con-tains all second-order partial derivatives ∂2φ(x)/∂xi ∂xj , and it is symmetricif φ is twice differentiable.

The Hessian matrix is often written as

Hφ =∂2φ(x)

∂x ∂x′, (14)

where the expression on the right-hand side is a notation, the precise meaningof which is given by

∂2φ(x)

∂x ∂x′=

∂

∂x′

(∂φ(x)

∂x′

)′. (15)

Given (13) and using the symmetry of Hφ, we obtain the second differentialas

d2φ = (da)′ dx = (dx)′(Hφ) dx,

which shows that the second differential of φ is a quadratic form in dx.Now, suppose that we have obtained, after some calculations, that d2φ =

(dx)′B(x) dx. Then,(dx)′(Hφ−B(x)) dx = 0

for all dx. Does this imply that Hφ = B(x)? No, it does not, as we have seenin the previous section. It does, however, imply that

(Hφ−B(x))′ + (Hφ−B(x)) = 0,

and hence that Hφ = (B(x) + B(x)′)/2, using the symmetry of Hφ. Thisproves the following result.

Theorem 18.3 (second identification theorem):

d2φ = (dx)′B(x) dx ⇐⇒ Hφ =B(x) +B(x)′

2.


The second identification theorem shows that there is a one-to-one cor-respondence between second-order differentials and second-order derivatives,but only if we make the matrix B(x) in the quadratic form symmetric. Hence,the second differential identifies the second derivative.

Example 18.2c: Consider again the quadratic function φ(x) = x′Ax. Thenwe can start with dφ = x′(A + A′) dx, as in Example 18.2b, and obtaind2φ = (dx)′(A+A′) dx. The matrix in the quadratic form is already symmet-ric, so we obtain directly Hφ = A+A′.

Alternatively — and this is often quicker — we differentiate φ twice with-out writing out the first differential in its final form. From

dφ = (dx)′Ax+ x′A dx,

we thus obtain

d2φ = 2(dx)′A dx, (16)

which identifies the Hessian matrix as Hφ = A+A′. (Notice that the matrixA in (16) is not necessarily symmetric.)

Even with such a simple function as φ(x) = x′Ax, the advantage andelegance of using differentials is clear. Without differentials we would need toprove first that ∂a′x/∂x′ = a′ and ∂x′Ax/∂x′ = x′(A+A′), and then use (15)to obtain

∂2x′Ax∂x∂x′

=∂(x′(A+A′))′

∂x′=∂(A+A′)x

∂x′= A+A′,

which is cumbersome in this simple case and not practical in more complexsituations.

9 CHAIN RULE FOR SECOND DIFFERENTIALS

Let us now return to Example 18.2b. The function φ in this example is afunction of x, and x is the argument of interest. This is why d2x = 0. Butif φ is a function of x, which in turn is a function of t, then it is no longertrue that d2x equals zero. More generally, suppose that z = f(y) and thaty = g(x), so that z = f(g(x)). Then,

dz = A(y) dy

and

d2z = (dA) dy +A(y) d2y. (17)

This is true whether or not y depends on some other variables. If we think ofz as a function of y, then d2y = 0, but if y depends on x then d2y is not zero;in fact,

dy = B(x) dx, d2y = (dB) dx.

Sec. 9 ] Chain rule for second differentials 437

This gives us the following result.

Theorem 18.4 (chain rule for second differentials): Let z = f(y) andy = g(x), so that z = f(g(x)). Then,

d2z = (dA)B(x) dx +A(y)(dB) dx,

where A(y) and B(x) are defined through

dz = A(y) dy, dy = B(x) dx.

In practice, one usually avoids Theorem 18.4 by going back to the firstdifferential dz = A(y) dy and differentiating again. This gives (17), from whichwe obtain the result step by step.

Example 18.4b: Let

φ(y1, y2) = ey1 sin y2, y1 = x1x22, y2 = x21x2.

Then, by Theorem 18.4,

d2φ = (da)′B(x) dx+ a(y)′(dB) dx,

where

a(y) = ey1(sin y2cos y2

), B(x) =

(x22 2x1x2

2x1x2 x21

).

Now, letting

C(y) = ey1(sin y2 cos y2cos y2 − sin y2

)

and

D1(x) = 2

(0 x2x2 x1

), D2(x) = 2

(x2 x1x1 0

),

we obtainda = C(y) dy = C(y)B(x) dx

anddB = (dx1)D1(x) + (dx2)D2(x).

It is convenient to write dx1 and dx2 in terms of dx, which can be done bydefining e1 = (1, 0)′ and e2 = (0, 1)′. Then, dx1 = e′1dx and dx2 = e′2dx, andhence

d2φ = (da)′B(x) dx + a(y)′(dB) dx

= (dx)′B(x)C(y)B(x) dx + a(y)′ ((dx1)D1(x) + (dx2)D2(x)) dx

= (dx)′B(x)C(y)B(x) dx + (dx)′e1a(y)′D1(x) dx+ (dx)′e2a(y)

′D2(x) dx

= (dx)′ [B(x)C(y)B(x) + e1a(y)′D1(x) + e2a(y)

′D2(x)] dx.


Some care is required where to position the scalars e′1dx and e′2dx in thematrix product. A scalar can be positioned anywhere in a matrix product,but we wish to position the two scalars in such a way that the usual matrixmultiplication rules still apply.

Having obtained the second differential in the desired form, Theorem 18.3implies that the Hessian is equal to

Hφ = B(x)C(y)B(x) +1

2(e1a(y)

′D1(x) +D1(x)a(y)e′1)

+1

2(e2a(y)

′D2(x) +D2(x)a(y)e′2) .

10 FOUR EXAMPLES

Let us provide four examples to show how the second differential can beobtained. The first three examples relate to scalar functions and the fourthexample to a matrix function. The matrixX has order n×q in Examples 18.5aand 18.6a, and order n× n in Examples 18.7a and 18.8a.

Example 18.5a: Let φ(X) = trX ′AX . Then,

dφ = d(trX ′AX) = tr d(X ′AX)

= tr(dX)′AX + trX ′A dX = trX ′(A+A′) dX

and

d2φ = d trX ′(A+A′) dX = tr(dX)′(A+A′) dX.

Example 18.6a: Let φ(X) = log |X ′X |. Then,

dφ = d log |X ′X | = tr(X ′X)−1d(X ′X)

= tr(X ′X)−1(dX)′X + tr(X ′X)−1X ′dX = 2 tr(X ′X)−1X ′dX

and

d2φ = 2 d(tr(X ′X)−1X ′dX

)

= 2 tr(d(X ′X)−1)X ′dX + 2 tr(X ′X)−1(dX)′dX

= −2 tr(X ′X)−1(dX ′X)(X ′X)−1X ′dX + 2 tr(X ′X)−1(dX)′dX

= −2 tr(X ′X)−1(dX)′X(X ′X)−1X ′dX

− 2 tr(X ′X)−1X ′(dX)(X ′X)−1X ′dX + 2 tr(X ′X)−1(dX)′dX

= 2 tr(X ′X)−1(dX)′MdX − 2 tr(X ′X)−1X ′(dX)(X ′X)−1X ′dX,

whereM = In−X(X ′X)−1X ′. Let us explain some of the steps in more detail.The second equality follows from considering (X ′X)−1X ′dX as a product

Sec. 11 ] The Kronecker product and vec operator 439

of three matrices: (X ′X)−1, X ′, and dX (a matrix of constants), the thirdequality uses the differential of the inverse in (12), and the fourth equalityseparates dX ′X into (dX)′X +X ′dX .

Example 18.7a: Let φ(X) = trXk for k = 1, 2, . . . Then, for k ≥ 1,

dφ = tr(dX)Xk−1 + trX(dX)Xk−2 + · · ·+ trXk−1dX

= k trXk−1dX,

and for k ≥ 2,

d2φ = k tr (dXk−1) dX = kk−2∑

j=0

trXj(dX)Xk−2−jdX.

Example 18.8a: Let F (X) = AX−1B. Then,

dF = A(dX−1)B = −AX−1(dX)X−1B

and

d2F = −A(dX−1)(dX)X−1B −AX−1(dX)(dX−1)B

= 2AX−1(dX)X−1(dX)X−1B.

These four examples provide the second differential; they do not yet pro-vide the Hessian matrix. In Section 18.14, we shall discuss the same fourexamples and obtain the Hessian matrices.

11 THE KRONECKER PRODUCT AND VEC OPERATOR

The theory and the four examples in the previous two sections demonstratethe elegance and simplicity of obtaining first and second differentials of scalar,vector, and matrix functions. But we also want to relate these first and sec-ond differentials to Jacobian matrices (first derivatives) and Hessian matrices(second derivatives). For this we need some more machinery, namely the vecoperator and the Kronecker product.

First, the vec operator. Consider an m × n matrix A. This matrix has ncolumns, say a1, . . . , an. Now define the mn × 1 vector vecA as the vectorwhich stacks these columns one underneath the other:

vecA =

a1a2...an

.


For example, if

A =

(1 2 34 5 6

),

then vecA = (1, 4, 2, 5, 3, 6)′. Of course, we have

d vecX = vec dX. (18)

If A and B are matrices of the same order, then we know from (10) thattrA′B =

∑ij aijbij . But (vecA)′(vecB) is also equal to this double sum.

Hence,

trA′B = (vecA)′(vecB), (19)

an important equality linking the vec operator to the trace.We also need the Kronecker product. Let A be an m× n matrix and B a

p× q matrix. The mp× nq matrix defined by

a11B . . . a1nB...

...am1B . . . amnB

is called the Kronecker product of A and B and is written as A ⊗ B. TheKronecker product A ⊗ B is thus defined for any pair of matrices A and B,unlike the matrix product AB which exists only if the number of columns inA equals the number of rows in B or if either A or B is a scalar.

The following three properties justify the name Kronecker product:

A⊗B ⊗ C = (A⊗B)⊗ C = A⊗ (B ⊗ C),

(A+B)⊗ (C +D) = A⊗ C +A⊗D +B ⊗ C +B ⊗D,

if A and B have the same order and C and D have the same order (notnecessarily equal to the order of A and B), and

(A⊗B)(C ⊗D) = AC ⊗BD,

if AC and BD exist.The transpose of a Kronecker product is

(A⊗B)′ = A′ ⊗B′.

If A and B are square matrices (not necessarily of the same order), then

tr(A⊗ B) = (trA)(trB),

and if A and B are nonsingular, then

(A⊗B)−1 = A−1 ⊗B−1.

Sec. 12 ] Identification 441

The Kronecker product and the vec operator are related through the equality

vecab′ = b⊗ a,

where a and b are column vectors of arbitrary order. Using this inequality, wesee that

vec(Abe′C) = vec(Ab)(C′e)′ = (C′e)⊗ (Ab)

= (C′ ⊗A)(e ⊗ b) = (C′ ⊗A) vec be′

for any vectors b and e. Then, writing B =∑

j bje′j where bj and ej denote

the jth column of B and I, respectively, we obtain the following importantrelationship, which is used frequently.

Theorem 18.5: For any matrices A, B, and C for which the product ABCis defined, we have

vecABC = (C′ ⊗A) vecB.

12 IDENTIFICATION

When we move from vector calculus to matrix calculus, we need an orderingof the functions and of the variables. It does not matter how we order them(any ordering will do), but an ordering is essential. We want to define matrixderivatives within the established theory of vector derivatives in such a waythat trivial changes such as relabeling functions or variables have only trivialconsequences for the derivative: rows and columns are permuted, but the rankis unchanged and the determinant (in the case of a square matrix) is alsounchanged, apart possibly from its sign. This is what we need to achieve. Thearrangement of the partial derivatives matters, because a derivative is morethan just a collection of partial derivatives. It is a mathematical concept, amathematical unit.

Thus motivated, we shall view the matrix function F (X) as a vector func-tion f(x), where f = vecF and x = vecX . We then obtain the followingextension of the first identification theorem:

d vecF = A(X) d vecX ⇐⇒ ∂ vecF (X)

∂(vecX)′= A(X),

and, similarly, for the second identification theorem:

d2φ = (d vecX)′B(X) d vecX ⇐⇒ Hφ =B(X) +B(X)′

2,

where we notice, as in Section 18.8, that we only provide the Hessian matrixfor scalar functions, not for vector or matrix functions.


13 THE COMMUTATION MATRIX

At this point, we need to introduce the commutation matrix. Let A be anm × n matrix. The vectors vecA and vecA′ contain the same mn elements,but in a different order. Hence, there exists a unique mn×mn matrix, whichtransforms vecA into vecA′. This matrix contains mn ones and mn(mn− 1)zeros and is called the commutation matrix, denoted by Kmn. (If m = n, wewrite Kn instead of Knn.) Thus,

Kmn vecA = vecA′. (20)

It can be shown that Kmn is orthogonal, i.e. K ′mn = K−1

mn. Also, premultiply-ing (20) by Knm gives KnmKmn vecA = vecA, which shows that KnmKmn =Imn. Hence,

K ′mn = K−1

mn = Knm.

The key property of the commutation matrix enables us to interchange (com-mute) the two matrices of a Kronecker product:

Kpm(A⊗B) = (B ⊗A)Kqn (21)

for any m × n matrix A and p × q matrix B. This is easiest shown, not byproving a matrix identity but by proving that the effect of the two matriceson an arbitrary vector is the same. Thus, let X be an arbitrary q× n matrix.Then, by repeated application of (20) and Theorem 18.5,

Kpm(A⊗B) vec X = Kpm vec BXA′ = vec AX ′B′

= (B ⊗A) vec X ′ = (B ⊗A)Kqn vec X.

Since X is arbitrary, (21) follows.The commutation matrix has many applications in matrix theory. Its im-

portance in matrix calculus stems from the fact that it transforms d vecX ′

into d vecX . The simplest example is the matrix function F (X) = X ′, whereX is an n× q matrix. Then,

d vecF = vec dX ′ = Knq vec dX,

so that the derivative is D vecF = Knq.The commutation matrix is also essential in identifying the Hessian matrix

from the second differential. The second differential of a scalar function oftentakes the form of a trace, either trA(dX)′BdX or trA(dX)BdX . We thenhave the following result, based on (19) and Theorem 18.5.

Theorem 18.6: Let φ be a twice differentiable real-valued function of ann× q matrix X . Then,

d2φ = trA(dX)′BdX ⇐⇒ Hφ =1

2(A′ ⊗B +A⊗B′)

Sec. 14 ] From second differential to Hessian 443

and

d2φ = trA(dX)BdX ⇐⇒ Hφ =1

2Kqn(A

′ ⊗B +B′ ⊗A).

To identify the Hessian matrix from the first expression, we do not needthe commutation matrix, but we do need the commutation matrix to identifythe Hessian matrix from the second expression.

14 FROM SECOND DIFFERENTIAL TO HESSIAN

We continue with the same four examples as discussed in Section 18.10, show-ing how to obtain the Hessian matrices from the second differentials, usingTheorem 18.6.

Example 18.5b: Let φ(X) = trX ′AX , where X is an n× q matrix. Then,

dφ = trX ′(A+A′) dX = trC′dX = (vecC)′d vecX,

where C = (A+A′)X , and d2φ = tr(dX)′(A+A′) dX . Hence, the derivativeis Dφ = (vecC)′ and the Hessian is Hφ = Iq ⊗ (A+A′).

Example 18.6b: Let φ(X) = log |X ′X |, where X is an n× q matrix of fullcolumn rank. Then, letting C = X(X ′X)−1 and M = In −X(X ′X)−1X ′,

dφ = 2 trC′dX = 2(vecC)′d vecX

andd2φ = 2 tr(X ′X)−1(dX)′MdX − 2 trC′(dX)C′dX.

This gives Dφ = 2(vecC)′ and

Hφ = 2(X ′X)−1 ⊗M − 2Kqn(C ⊗ C′).

Example 18.7b: Let φ(X) = trXk for k = 1, 2, . . . , where X is a squaren× n matrix. Then, for k ≥ 1,

dφ = k trXk−1dX = k(vecX ′k−1)′d vecX,

and for k ≥ 2,

d2φ = k

k−2∑

j=0

trXj(dX)Xk−2−jdX.

This gives Dφ = k(vecX ′k−1)′ and

Hφ = (k/2)

k−2∑

j=0

Kn(X′j ⊗Xk−2−j +X ′k−2−j ⊗Xj).


Example 18.8b: Let F (X) = AX−1B, where X is a nonsingular n × nmatrix. Then, using Theorem 18.5,

d vecF = −((X−1B)′ ⊗ (AX−1)

)d vecX,

and henceD vecF = −(X−1B)′ ⊗ (AX−1).

To obtain the Hessian matrix of the stth element of F , we let

Cts = X−1Bete′sAX

−1,

where es and et are elementary vectors with 1 in the sth (respectively, tth)position and zeros elsewhere. Then,

d2Fst = 2e′sAX−1(dX)X−1(dX)X−1Bet = 2 trCts(dX)X−1(dX)

and henceHFst = Kn(C

′ts ⊗X−1 +X ′−1 ⊗ Cts).

15 SYMMETRY AND THE DUPLICATION MATRIX

Many matrices in statistics and econometrics are symmetric, for example vari-ance matrices. When we differentiate with respect to symmetric matrices, wemust take the symmetry into account and we need the duplication matrix.

Let A be a square n×nmatrix. Then vech(A) will denote the 12n(n+1)×1

vector that is obtained from vecA by eliminating all elements of A above thediagonal. For example, for n = 3,

vecA = (a11, a21, a31, a12, a22, a32, a13, a23, a33)′

and

vech(A) = (a11, a21, a31, a22, a32, a33)′. (22)

In this way, for symmetric A, vech(A) contains only the generically distinctelements of A. Since the elements of vecA are those of vech(A) with somerepetitions, there exists a unique n2× 1

2n(n+1) matrix which transforms, forsymmetric A, vech(A) into vecA. This matrix is called the duplication matrixand is denoted by Dn. Thus,

Dn vech(A) = vecA (A = A′). (23)

The matrix Dn has full column rank 12n(n+1), so that D′

nDn is nonsingular.This implies that vech(A) can be uniquely solved from (23), and we have

vech(A) = (D′nDn)

−1D′n vecA (A = A′).

Sec. 16 ] Maximum likelihood 445

One can show (but we will not do so here) that the duplication matrix isconnected to the commutation matrix by

KnDn = Dn, Dn(D′nDn)

−1D′n =

1

2(In2 +Kn).

Much of the interest in the duplication matrix is due to the importance of thematrix D′

n(A⊗A)Dn, where A is an n×n matrix. This matrix is important,because the scalar function φ(X) = trAX ′AX occurs frequently in statisticsand econometrics, for example in the next section on maximum likelihood.When A and X are known to be symmetric we have

d2φ = 2 trA(dX)′A dX = 2(d vecX)′(A⊗ A)d vecX

= 2(d vech(X))′D′n(A⊗A)Dn d vech(X),

and hence, Hφ = 2D′n(A⊗A)Dn.

From the relationship (again not proved here)

Dn(D′nDn)

−1D′n(A⊗A)Dn = (A⊗A)Dn,

which is valid for any n× n matrix A, not necessarily symmetric, we obtainthe inverse

(D′n(A⊗A)Dn)

−1 = (D′nDn)

−1D′n(A

−1 ⊗A−1)Dn(D′nDn)

−1, (24)

where A is nonsingular. Finally, we present the determinant:

|D′n(A⊗A)Dn| = 2

12n(n−1)|A|n+1. (25)

16 MAXIMUM LIKELIHOOD

This final section brings together most of the material that has been treatedin this chapter: first and second differentials, the Hessian matrix, and thetreatment of symmetry (duplication matrix).

We consider a sample of m× 1 vectors y1, y2, . . . , yn from the multivariatenormal distribution with mean µ and variance Ω, where Ω is positive definiteand n ≥ m+ 1. The density of yi is

f(yi) = (2π)−m/2|Ω|−1/2 exp

(−1

2(yi − µ)′Ω−1(yi − µ)

),

and since the yi are independent and identically distributed, the joint densityof (y1, . . . , yn) is given by

∏i f(yi). The ‘likelihood’ is equal to the joint den-

sity, but now thought of as a function of the parameters µ and Ω, rather thanof the observations. Its logarithm is the ‘loglikelihood’, which here takes theform

Λ(µ,Ω) = −mn2

log 2π − n

2log |Ω| − 1

2

n∑

i=1

(yi − µ)′Ω−1(yi − µ).


The maximum likelihood estimators are obtained by maximizing the loglike-lihood (which is the same, but usually easier, as maximizing the likelihood).Thus, we differentiate Λ and obtain

dΛ = −n2d log |Ω|+ 1

2

n∑

i=1

(dµ)′Ω−1(yi − µ)

− 1

2

n∑

i=1

(yi − µ)′(dΩ−1)(yi − µ) +1

2

n∑

i=1

(yi − µ)′Ω−1dµ

= −n2d log |Ω| − 1

2

n∑

i=1

(yi − µ)′(dΩ−1)(yi − µ) +n∑

i=1

(yi − µ)′Ω−1dµ

= −n2tr(Ω−1dΩ + SdΩ−1) +

n∑

i=1

(yi − µ)′Ω−1dµ, (26)

where

S = S(µ) =1

n

n∑

i=1

(yi − µ)(yi − µ)′.

Denoting the maximum likelihood estimators by µ and Ω, letting S = S(µ),and setting dΛ = 0 then implies that

tr(Ω−1 − Ω−1SΩ−1

)dΩ = 0

for all dΩ andn∑

i=1

(yi − µ)′Ω−1dµ = 0

for all dµ. This, in turn, implies that

Ω−1 = Ω−1SΩ−1,n∑

i=1

(yi − µ) = 0.

Hence, the maximum likelihood estimators are given by

µ =1

n

n∑

i=1

yi = y (27)

and

Ω =1

n

n∑

i=1

(yi − y)(yi − y)′. (28)

We note that the condition that Ω is symmetric has not been imposed. Butsince the solution (28) is symmetric, imposing the condition would have madeno difference.

16 ] Maximum likelihood 447

The second differential is obtained by differentiating (26) again. This gives

d2Λ = −n2tr((dΩ−1)dΩ + (dS)dΩ−1 + Sd2Ω−1

)− n(dµ)′Ω−1dµ

+

n∑

i=1

(yi − µ)′(dΩ−1)dµ. (29)

We are usually not primarily interested in the Hessian matrix but in its ex-pectation. Hence, we do not evaluate (29) further and first take expectations.Since E(S) = Ω and E(dS) = 0, we obtain

E d2Λ = −n2tr((dΩ−1)dΩ + Ωd2Ω−1

)− n(dµ)′Ω−1dµ

=n

2tr Ω−1(dΩ)Ω−1dΩ− n tr(dΩ)Ω−1(dΩ)Ω−1 − n(dµ)′Ω−1dµ

= −n2tr Ω−1(dΩ)Ω−1dΩ− n(dµ)′Ω−1dµ, (30)

using the facts that dΩ−1 = −Ω−1(dΩ)Ω−1 and

d2Ω−1 = −(dΩ−1)(dΩ)Ω−1 − Ω−1(dΩ)dΩ−1

= 2Ω−1(dΩ)Ω−1(dΩ)Ω−1.

To obtain the ‘information matrix’ we need to take the symmetry of Ω intoaccount and this is where the duplication matrix appears. So far, we haveavoided the vec operator and in practical situations one should work withdifferentials (rather than with derivatives) as long as possible. But we cannotgo further than (30) without use of the vec operator. Thus, from (30),

−Ed2Λ =n

2trΩ−1(dΩ)Ω−1dΩ+ n(dµ)′Ω−1dµ

=n

2(d vecΩ)′(Ω−1 ⊗ Ω−1) d vecΩ + n(dµ)′Ω−1dµ

=n

2(d vech(Ω))′D′

m(Ω−1 ⊗ Ω−1)Dm d vech(Ω) + n(dµ)′Ω−1dµ.

Hence, the information matrix for µ and vech(Ω) is

F = n

(Ω−1 00 1

2D′m(Ω

−1 ⊗ Ω−1)Dm

).

The results on the duplication matrix in Section 18.15 also allow us to obtainthe inverse:

(F/n)−1 =

(Ω 00 2(D′

mDm)−1D′m(Ω⊗ Ω)Dm(D′

mDm)−1

)

and the determinant:

|F/n| = |Ω| · |2(D′mDm)−1D′

m(Ω⊗ Ω)Dm(D′mDm)−1| = 2m|Ω|m+2.


FURTHER READING

§2–3. Chapter 5 discusses differentials in more detail, and contains the firstidentification theorem (Theorem 5.6) and the chain rule for first differentials(Theorem 5.9), officially called ‘Cauchy’s rule of invariance’.

§4. Optimization is discussed in Chapter 7.

§5. See Chapter 11, Sections 11.29–11.32 and Chapter 13, Sections 13.4 and13.19.

§6. The trace is discussed in Section 1.10, the extension from vector calculusto matrix calculus in Section 5.15, and the differentials of the determinantand inverse in Sections 8.3 and 8.4.

§7. See Section 1.6 for more detailed results.

§8–9. Second differentials are introduced in Chapter 6. The second identifica-tion theorem is proved in Section 6.8 and the chain rule for second differentialsin Section 6.11.

§11. See Chapter 2 for many more details on the vec operator and the Kro-necker product. Theorem 2.2 is restated here as Theorem 18.5.

§12. See Sections 5.15 and 10.2.

§13 and §15. The commutation matrix and the duplication matrix are dis-cussed in Chapter 3.

§16. Many aspects of maximum likelihood estimation are treated in Chap-ter 15.

Documents

Matrix Di erential Calculus with Applications in Statistics and … · 2018. 11. 16. · Matrix Di erential Calculus with Applications in Statistics and Econometrics Third Edition