46
Numerical range From Wikipedia, the free encyclopedia

Numerical Range

  • Upload
    man

  • View
    236

  • Download
    5

Embed Size (px)

DESCRIPTION

1. From Wikipedia, the free encyclopedia2. Lexicographical order

Citation preview

  • Numerical rangeFrom Wikipedia, the free encyclopedia

  • Contents

    1 ND4J 11.1 Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    2 Newtons identities 22.1 Mathematical statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2.1.1 Formulation in terms of symmetric polynomials . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Application to the roots of a polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3 Application to the characteristic polynomial of a matrix . . . . . . . . . . . . . . . . . . . 42.1.4 Relation with Galois theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.2 Related identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.1 A variant using complete homogeneous symmetric polynomials . . . . . . . . . . . . . . . 52.2.2 Expressing elementary symmetric polynomials in terms of power sums . . . . . . . . . . . 52.2.3 Expressing complete homogeneous symmetric polynomials in terms of power sums . . . . 52.2.4 Expressing power sums in terms of elementary symmetric polynomials . . . . . . . . . . . 62.2.5 Expressing power sums in terms of complete homogeneous symmetric polynomials . . . . 62.2.6 Expressions as determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.3 Derivation of the identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.1 From the special case n = k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Comparing coecients in series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 As a telescopic sum of symmetric function identities . . . . . . . . . . . . . . . . . . . . 10

    2.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3 Non-negative matrix factorization 123.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.3.1 Approximate non-negative matrix factorization . . . . . . . . . . . . . . . . . . . . . . . 133.3.2 Convex non-negative matrix factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    i

  • ii CONTENTS

    3.3.3 Nonnegative rank factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3.4 Dierent cost functions and regularizations . . . . . . . . . . . . . . . . . . . . . . . . . 143.3.5 Online NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4.1 Exact NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    3.5 Relation to other techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.6 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.7 Clustering property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.8 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    3.8.1 Text mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.8.2 Spectral data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.8.3 Scalable Internet distance prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.8.4 Non-stationary speech denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.8.5 Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.8.6 Nuclear Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.9 Current research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.10 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.11 Sources and external links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3.11.1 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.11.2 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    4 Nonlinear eigenproblem 224.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    5 Nonnegative rank (linear algebra) 235.1 Formal Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.1.1 A Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Connection with the linear rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Computing the nonnegative rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4 Ancillary Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    6 Norm (mathematics) 256.1 Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6.3.1 Absolute-value norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3.2 Euclidean norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3.3 Taxicab norm or Manhattan norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.3.4 p-norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.3.5 Maximum norm (special case of: innity norm, uniform norm, or supremum norm) . . . . 286.3.6 Zero norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

  • CONTENTS iii

    6.3.7 Other norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.3.8 Innite-dimensional case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306.5 Classication of seminorms: absolutely convex absorbing sets . . . . . . . . . . . . . . . . . . . . 326.6 Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.7 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    7 Normal basis 357.1 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.2 Primitive normal basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.3 Free elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.4 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    8 Null vector 378.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    9 Numerical range 399.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399.2 Generalisations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409.3 See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409.5 Text and image sources, contributors, and licenses . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    9.5.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419.5.2 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419.5.3 Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

  • Chapter 1

    ND4J

    ND4J is a free, open-source extension of the Java programming language operating on the Java Virtual Machinethough it is compatible with both Scala and Clojure.[1] Its eectively a scientic computing library for linear algebraand matrix manipulation in a production environment, integrating with Hadoop and Spark to work with distributedGPUs.ND4J has primarily been developed by the group in San Francisco that built Deeplearning4j, led by Adam Gibson.[2]It was created under an Apache Software Foundation license.

    1.1 DistributedND4Js operations include distributed parallel versions. They can take place in a cluster and process massive amountsof data. Matrix manipulation occurs in parallel on GPUs or CPUs in the cloud, and can work within Spark or Hadoopclusters.

    1.2 See also NumPy SciPy Torch Theano

    1.3 References[1] Ocial Website.

    [2] Github Repository.

    1.4 External links Ocial website

    1

  • Chapter 2

    Newtons identities

    In mathematics,Newtons identities, also known as theNewtonGirard formulae, give relations between two typesof symmetric polynomials, namely between power sums and elementary symmetric polynomials. Evaluated at theroots of a monic polynomial P in one variable, they allow expressing the sums of the k-th powers of all roots of P(counted with their multiplicity) in terms of the coecients of P, without actually nding those roots. These identitieswere found by Isaac Newton around 1666, apparently in ignorance of earlier work (1629) by Albert Girard. They haveapplications in many areas of mathematics, including Galois theory, invariant theory, group theory, combinatorics,as well as further applications outside mathematics, including general relativity.

    2.1 Mathematical statement

    2.1.1 Formulation in terms of symmetric polynomialsLet x1, , xn be variables, denote for k 1 by pk(x1, , xn) the k-th power sum:

    pk(x1; : : : ; xn) =Xn

    i=1xki = x

    k1 + + xkn;

    and for k 0 denote by ek(x1, , xn) the elementary symmetric polynomial (that is, the sum of all distinct productsof k distinct variables), so

    e0(x1; : : : ; xn) = 1;

    e1(x1; : : : ; xn) = x1 + x2 + + xn;e2(x1; : : : ; xn) =

    P1i n:

    Then Newtons identities can be stated as

    kek(x1; : : : ; xn) =kXi=1

    (1)i1eki(x1; : : : ; xn)pi(x1; : : : ; xn);

    valid for all n 1 and k 1.Also, one has

    0 =kX

    i=kn(1)i1eki(x1; : : : ; xn)pi(x1; : : : ; xn);

    2

  • 2.1. MATHEMATICAL STATEMENT 3

    for all k > n 1.Concretely, one gets for the rst few values of k:

    e1(x1; : : : ; xn) = p1(x1; : : : ; xn);

    2e2(x1; : : : ; xn) = e1(x1; : : : ; xn)p1(x1; : : : ; xn) p2(x1; : : : ; xn);3e3(x1; : : : ; xn) = e2(x1; : : : ; xn)p1(x1; : : : ; xn) e1(x1; : : : ; xn)p2(x1; : : : ; xn) + p3(x1; : : : ; xn):

    The form and validity of these equations do not depend on the number n of variables (although the point where theleft-hand side becomes 0 does, namely after the n-th identity), which makes it possible to state them as identities inthe ring of symmetric functions. In that ring one has

    e1 = p1;

    2e2 = e1p1 p2;3e3 = e2p1 e1p2 + p3;4e4 = e3p1 e2p2 + e1p3 p4;

    and so on; here the left-hand sides never become zero. These equations allow to recursively express the ei in terms ofthe pk; to be able to do the inverse, one may rewrite them as

    p1 = e1;

    p2 = e1p1 2e2;p3 = e1p2 e2p1 + 3e3;p4 = e1p3 e2p2 + e3p1 4e4;

    ...

    In general, we have

    pk(x1; : : : ; xn) = (1)k1kek(x1; : : : ; xn) +k1Xi=1

    (1)k1+ieki(x1; : : : ; xn)pi(x1; : : : ; xn);

    valid for all n 1 and k 1.Also, one has

    pk(x1; : : : ; xn) =k1X

    i=kn(1)k1+ieki(x1; : : : ; xn)pi(x1; : : : ; xn);

    for all k > n 1.

    2.1.2 Application to the roots of a polynomialThe polynomial with roots xi may be expanded as

    nYi=1

    (x xi) =nX

    k=0

    (1)n+kenkxk;

    where the coecients ek(x1; : : : ; xn) are the symmetric polynomials dened above. Given the power sums of theroots

  • 4 CHAPTER 2. NEWTONS IDENTITIES

    pk(x1; : : : ; xn) =nXi=1

    xki ;

    the coecients of the polynomial with roots x1; : : : ; xn may be expressed recursively in terms of the power sums as

    e0 = 1;

    e1 = p1;

    e2 =1

    2(e1p1 p2);

    e3 =1

    3(e2p1 e1p2 + p3);

    e4 =1

    4(e3p1 e2p2 + e1p3 p4);

    ...

    Formulating polynomial this way is useful in using the method of Delves and Lyness[1] to nd the zeros of an analyticfunction.

    2.1.3 Application to the characteristic polynomial of a matrixWhen the polynomial above is the characteristic polynomial of a matrix A (in particular when A is the companionmatrix of the polynomial), the roots xi are the eigenvalues of the matrix, counted with their algebraic multiplicity.For any positive integer k, the matrix Ak has as eigenvalues the powers xik, and each eigenvalue xi of A contributesits multiplicity to that of the eigenvalue xik of Ak. Then the coecients of the characteristic polynomial of Ak aregiven by the elementary symmetric polynomials in those powers xik. In particular, the sum of the xik, which is thek-th power sum sk of the roots of the characteristic polynomial of A, is given by its trace:

    sk = tr(Ak) :

    The Newton identities now relate the traces of the powers Ak to the coecients of the characteristic polynomial ofA. Using them in reverse to express the elementary symmetric polynomials in terms of the power sums, they can beused to nd the characteristic polynomial by computing only the powers Ak and their traces.This computation requires computing the traces of matrix powers Ak and solving a triangular system of equations.Both can be done in complexity class NC (solving a triangular system can be done by divide-and-conquer). Therefore,characteristic polynomial of a matrix can be computed in NC. By the Cayley-Hamilton theorem, every matrix satisesits characteristic polynomial, and a simple transformation allows to nd the matrix inverse in NC.Rearranging the computations into an ecient form leads to the Fadeev-Leverrier algorithm (1840), a fast parallelimplementation of it is due to L. Csanky (1976). Its disadvantage is that it requires division by integers, so in generalthe eld should have characteristic, 0.

    2.1.4 Relation with Galois theoryFor a given n, the elementary symmetric polynomials ek(x1,,xn) for k = 1,, n form an algebraic basis for the spaceof symmetric polynomials in x1,. xn: every polynomial expression in the xi that is invariant under all permutations ofthose variables is given by a polynomial expression in those elementary symmetric polynomials, and this expressionis unique up to equivalence of polynomial expressions. This is a general fact known as the fundamental theoremof symmetric polynomials, and Newtons identities provide explicit formulae in the case of power sum symmetricpolynomials. Applied to the monic polynomial tn +Pnk=1(1)kaktnk with all coecients ak considered as freeparameters, this means that every symmetric polynomial expression S(x1,,xn) in its roots can be expressed insteadas a polynomial expression P(a1,,an) in terms of its coecients only, in other words without requiring knowledgeof the roots. This fact also follows from general considerations in Galois theory (one views the ak as elements of a

  • 2.2. RELATED IDENTITIES 5

    base eld with roots in an extension eld whose Galois group permutes them according to the full symmetric group,and the eld xed under all elements of the Galois group is the base eld).The Newton identities also permit expressing the elementary symmetric polynomials in terms of the power sumsymmetric polynomials, showing that any symmetric polynomial can also be expressed in the power sums. In fact therst n power sums also form an algebraic basis for the space of symmetric polynomials.

    2.2 Related identitiesThere are a number of (families of) identities that, while they should be distinguished from Newtons identities, arevery closely related to them.

    2.2.1 A variant using complete homogeneous symmetric polynomialsDenoting by hk the complete homogeneous symmetric polynomial that is the sum of all monomials of degree k,the power sum polynomials also satisfy identities similar to Newtons identities, but not involving any minus signs.Expressed as identities of in the ring of symmetric functions, they read

    khk =kXi=1

    hkipi;

    valid for all n k 1. Contrary to Newtons identities, the left-hand sides do not become zero for large k, and theright-hand sides contain ever more non-zero terms. For the rst few values of k, one has

    h1 = p1;

    2h2 = h1p1 + p2;

    3h3 = h2p1 + h1p2 + p3:

    These relations can be justied by an argument analogous to the one by comparing coecients in power series givenabove, based in this case on the generating function identity

    1Xk=0

    hk(X1; : : : ; Xn)tk =

    nYi=1

    1

    1Xit :

    Proofs of Newtons identities, like these given below, cannot be easily adapted to prove these variants of those iden-tities.

    2.2.2 Expressing elementary symmetric polynomials in terms of power sumsAs mentioned, Newtons identities can be used to recursively express elementary symmetric polynomials in termsof power sums. Doing so requires the introduction of integer denominators, so it can be done in the ring Q ofsymmetric functions with rational coecients:and so forth. Applied to a monic polynomial, these formulae express the coecients in terms of the power sums ofthe roots: replace each ei by ai and each pk by sk.

    2.2.3 Expressing complete homogeneous symmetric polynomials in terms of power sumsThe analogous relations involving complete homogeneous symmetric polynomials can be similarly developed, givingequationsand so forth, in which there are only plus signs. These expressions correspond exactly to the cycle index polynomialsof the symmetric groups, if one interprets the power sums pi as indeterminates: the coecient in the expression for hk

  • 6 CHAPTER 2. NEWTONS IDENTITIES

    of any monomial p1m1p2m2plml is equal to the fraction of all permutations of k that have m1 xed points, m2 cyclesof length 2,, andml cycles of length l. Explicitly, this coecient can be written as 1/N whereN = li=1(mi! imi); thisN is the number permutations commuting with any given permutation of the given cycle type. The expressionsfor the elementary symmetric functions have coecients with the same absolute value, but a sign equal to the sign of, namely (1)m2+m4+....It can be proved by considering the following:

    mf(m;m1; :::;mn) = f(m 1;m1 1; :::;mn) + :::+ f(m n;m1; :::;mn 1)

    m1

    nYi=1

    1

    imimi!+ :::+ nmn

    nYi=1

    1

    imimi!= m

    nYi=1

    1

    imimi!

    2.2.4 Expressing power sums in terms of elementary symmetric polynomials

    One may also use Newtons identities to express power sums in terms of symmetric polynomials, which does notintroduce denominators:

    p1 = e1;

    p2 = e21 2e2;

    p3 = e31 3e2e1 + 3e3;

    p4 = e41 4e2e21 + 4e3e1 + 2e22 4e4;

    p5 = e51 5e2e31 + 5e3e21 + 5e22e1 5e4e1 5e3e2 + 5e5;

    p6 = e61 6e2e41 + 6e3e31 + 9e22e21 6e4e21 12e3e2e1 + 6e5e1 2e32 + 3e23 + 6e4e2 6e6:

    The rst four formulas were obtained by Albert Girard in 1629 (thus before Newton).[2]

    The general formula (for all positive integers m and n) is:

    pm =X

    r1+2r2++nrn=mr10;:::;rn0

    (1)mm(r1 + r2 + + rn 1)!r1!r2! rn!

    nYi=1

    (ei)ri

    which can be proved by considering the following:

    f(m; r1; ; rn) = f(m 1; r1 1; ; rn) + + f(m n; r1; ; rn 1)=

    1

    (r1 1)! rn! (m 1)(r1 + + rn 2)! + +1

    r1! (rn 1)! (m n)(r1 + + rn 2)!

    =1

    r1! rn! [r1(m 1) + + rn(m n)] [r1 + + rn 2]!

    =1

    r1! rn! [m(r1 + + rn)m] [r1 + + rn 2]!

    =m(r1 + + rn 1)!

    r1! rn!

    2.2.5 Expressing power sums in terms of complete homogeneous symmetric polynomials

    Finally one may use the variant identities involving complete homogeneous symmetric polynomials similarly to ex-press power sums in term of them:

  • 2.2. RELATED IDENTITIES 7

    p1 = +h1;

    p2 = h21 + 2h2;p3 = +h

    31 3h2h1 + 3h3;

    p4 = h41 + 4h2h21 4h3h1 2h22 + 4h4;p5 = +h

    51 5h2h31 + 5h22h1 + 5h3h21 5h3h2 5h4h1 + 5h5;

    p6 = h61 + 6h2h41 9h22h21 6h3h31 + 2h32 + 12h3h2h1 + 6h4h21 3h23 6h4h2 6h1h5 + 6h6;

    and so on. Apart from the replacement of each ei by the corresponding hi, the only change with respect to the previousfamily of identities is in the signs of the terms, which in this case depend just on the number of factors present: thesign of the monomial li=1hmii is (1)m1+m2+m3+. In particular the above description of the absolute value of thecoecients applies here as well.The general formula (for all positive integers m and n) is:

    pm = X

    m1+2m2++nmn=mm10;:::;mn0

    m(r1 + r2 + + rn 1)!r1!r2! rn!

    nYi=1

    (hi)ri

    2.2.6 Expressions as determinantsOne can obtain explicit formulas for the above expressions in the form of determinants, by considering the rst n ofNewtons identities (or it counterparts for the complete homogeneous polynomials) as linear equations in which theelementary symmetric functions are known and the power sums are unknowns (or vice versa), and apply Cramersrule to nd the solution for the nal unknown. For instance taking Newtons identities in the form

    e1 = 1p1;

    2e2 = e1p1 1p2;3e3 = e2p1 e1p2 + 1p3;

    ...nen = en1p1 en2p2 + + (1)ne1pn1 + (1)n1pnwe consider p1 , p2 , p3 , , (1)npn1 and pn as unknowns, and solve for the nal one, giving

    pn =

    1 0 e1e1 1 0 2e2e2 e1 1 3e3... . . . . . . ...

    en1 e2 e1 nen

    1 0 e1 1 0 e2 e1 1... . . . . . .

    en1 e2 e1 (1)n1

    1

    =1

    (1)n1

    1 0 e1e1 1 0 2e2e2 e1 1 3e3... . . . . . . ...

    en1 e2 e1 nen

    =

    e1 1 0 2e2 e1 1 0 3e3 e2 e1 1... . . . . . .

    nen en1 e1

    :

  • 8 CHAPTER 2. NEWTONS IDENTITIES

    Solving for en instead of for pn is similar, as the analogous computations for the complete homogeneous symmetricpolynomials; in each case the details are slightly messier than the nal results, which are (Macdonald 1979, p. 20):

    en =1

    n!

    p1 1 0 p2 p1 2 0 ... . . . . . .

    pn1 pn2 p1 n 1pn pn1 p2 p1

    pn = (1)n1

    h1 1 0 2h2 h1 1 0 3h3 h2 h1 1... . . . . . .

    nhn hn1 h1

    hn =

    1

    n!

    p1 1 0 p2 p1 2 0 ... . . . . . .

    pn1 pn2 p1 1 npn pn1 p2 p1

    :

    Note that the use of determinants makes that the formula for hn has additional minus signs compared to the one foren , while the situation for the expanded form given earlier is opposite. As remarked in (Littlewood 1950, p. 84) onecan alternatively obtain the formula for hn by taking the permanent of the matrix for en instead of the determinant,and more generally an expression for any Schur polynomial can be obtained by taking the corresponding immanantof this matrix.

    2.3 Derivation of the identitiesEach of Newtons identities can easily be checked by elementary algebra; however, their validity in general needs aproof. Here are some possible derivations.

    2.3.1 From the special case n = kOne can obtain the k-th Newton identity in k variables by substitution into

    kYi=1

    (t xi) =kXi=0

    (1)kieki(x1; : : : ; xk)ti

    as follows. Substituting xj for t gives

    0 =kXi=0

    (1)kieki(x1; : : : ; xk)xji for1 j k

    Summing over all j gives

    0 = (1)kkek(x1; : : : ; xk) +kXi=1

    (1)kieki(x1; : : : ; xk)pi(x1; : : : ; xk);

    where the terms for i = 0 were taken out of the sum because p0 is (usually) not dened. This equation immediatelygives the k-th Newton identity in k variables. Since this is an identity of symmetric polynomials (homogeneous) of

  • 2.3. DERIVATION OF THE IDENTITIES 9

    degree k, its validity for any number of variables follows from its validity for k variables. Concretely, the identities inn < k variables can be deduced by setting k n variables to zero. The k-th Newton identity in n > k variables containsmore terms on both sides of the equation than the one in k variables, but its validity will be assured if the coecientsof any monomial match. Because no individual monomial involves more than k of the variables, the monomial willsurvive the substitution of zero for some set of n k (other) variables, after which the equality of coecients is onethat arises in the k-th Newton identity in k (suitably chosen) variables.

    2.3.2 Comparing coecients in seriesAnother derivation can be obtained by computations in the ring of formal power series R[[t]], where R is Z[x1,,xn], the ring of polynomials in n variables x1,, xn over the integers.Starting again from the basic relation

    nYi=1

    (t xi) =nX

    k=0

    (1)kaktnk

    and reversing the polynomials by substituting 1/t for t and then multiplying both sides by tn to remove negativepowers of t, gives

    nYi=1

    (1 xit) =nX

    k=0

    (1)kaktk:

    (the above computation should be performed in the eld of fractions of R[[t]]; alternatively, the identity can beobtained simply by evaluating the product on the left side)Swapping sides and expressing the ai as the elementary symmetric polynomials they stand for gives the identity

    nXk=0

    (1)kek(x1; : : : ; xn)tk =nYi=1

    (1 xit):

    One formally dierentiates both sides with respect to t, and then (for convenience) multiplies by t, to obtain

    nXk=0

    (1)kkek(x1; : : : ; xn)tk = tnXi=1

    h(xi)

    Yj 6=i(1 xjt)

    i=

    nXi=1

    xit

    1 xit

    !Ynj=1

    (1 xjt)

    = 24 nXi=1

    1Xj=1

    (xit)j

    35" nX`=0

    (1)`e`(x1; : : : ; xn)t`#

    =

    24 1Xj=1

    pj(x1; : : : ; xn)tj

    35" nX`=0

    (1)`1e`(x1; : : : ; xn)t`#;

    where the polynomial on the right hand side was rst rewritten as a rational function in order to be able to factor outa product out of the summation, then the fraction in the summand was developed as a series in t, using the formula

    X

    1X = X +X2 +X3 +X4 +X5 +

    and nally the coecient of each t j was collected, giving a power sum. (The series in t is a formal power series, butmay alternatively be thought of as a series expansion for t suciently close to 0, for those more comfortable with that;

  • 10 CHAPTER 2. NEWTONS IDENTITIES

    in fact one is not interested in the function here, but only in the coecients of the series.) Comparing coecients oftk on both sides one obtains

    (1)kkek(x1; : : : ; xn) =kX

    j=1

    (1)kj1pj(x1; : : : ; xn)ekj(x1; : : : ; xn);

    which gives the k-th Newton identity.

    2.3.3 As a telescopic sum of symmetric function identitiesThe following derivation, given essentially in (Mead, 1992), is formulated in the ring of symmetric functions forclarity (all identities are independent of the number of variables). Fix some k > 0, and dene the symmetric functionr(i) for 2 i k as the sum of all distinct monomials of degree k obtained by multiplying one variable raised tothe power i with k i distinct other variables (this is the monomial symmetric function m where is a hook shape(i,1,1,1)). In particular r(k) = pk; for r(1) the description would amount to that of ek, but this case was excludedsince here monomials no longer have any distinguished variable. All products pieki can be expressed in terms of ther(j) with the rst and last case being somewhat special. One has

    pieki = r(i) + r(i+ 1) for1 < i < k

    since each product of terms on the left involving distinct variables contributes to r(i), while those where the variablefrom pi already occurs among the variables of the term from eki contributes to r(i + 1), and all terms on the right areso obtained exactly once. For i = k one multiplies by e0 = 1, giving trivially

    pke0 = pk = r(k)

    Finally the product p1ek for i = 1 gives contributions to r(i + 1) = r(2) like for other values i < k, but the remainingcontributions produce k times each monomial of ek, since any one of the variables may come from the factor p1; thus

    p1ek1 = kek + r(2)

    The k-th Newton identity is now obtained by taking the alternating sum of these equations, in which all terms of theform r(i) cancel out.

    2.4 See also Power sum symmetric polynomial Elementary symmetric polynomial Symmetric function Fluid solutions, an article giving an application of Newtons identities to computing the characteristic polyno-mial of the Einstein tensor in the case of a perfect uid, and similar articles on other types of exact solutionsin general relativity.

    2.5 References[1] Delves, L. M. (1967). A Numerical Method of Locating the Zeros of an Analytic Function. Mathematics of Computation

    21: 543560. doi:10.2307/2004999.

    [2] Tignol, Jean-Pierre (2004). Galois theory of algebraic equations (Reprinted. ed.). River Edge, NJ: World Scientic. pp.3738. ISBN 981-02-4541-6.

  • 2.6. EXTERNAL LINKS 11

    Tignol, Jean-Pierre (2001). Galois theory of algebraic equations. Singapore: World Scientic. ISBN 978-981-02-4541-2.

    Bergeron, F.; Labelle, G.; and Leroux, P. (1998). Combinatorial species and tree-like structures. Cambridge:Cambridge University Press. ISBN 978-0-521-57323-8.

    Cameron, Peter J. (1999). Permutation Groups. Cambridge: Cambridge University Press. ISBN 978-0-521-65378-7.

    Cox, David; Little, John, and O'Shea, Donal (1992). Ideals, Varieties, and Algorithms. New York: Springer-Verlag. ISBN 978-0-387-97847-5.

    Eppstein, D.; Goodrich, M. T. (2007). Space-ecient straggler identication in round-trip data streams viaNewtons identities and invertible Bloom lters. Algorithms and Data Structures, 10th International Workshop,WADS 2007. Springer-Verlag, Lecture Notes in Computer Science 4619. pp. 637648. arXiv:0704.3313

    Littlewood, D. E. (1950). The theory of group characters and matrix representations of groups. Oxford: OxfordUniversity Press. viii+310. ISBN 0-8218-4067-3.

    Macdonald, I. G. (1979). Symmetric functions and Hall polynomials. Oxford Mathematical Monographs.Oxford: The Clarendon Press, Oxford University Press. viii+180. ISBN 0-19-853530-9. MR 84g:05003.

    Macdonald, I. G. (1995). Symmetric functions and Hall polynomials. Oxford Mathematical Monographs (Sec-ond ed.). New York: Oxford Science Publications. The Clarendon Press, Oxford University Press. p. x+475.ISBN 0-19-853489-2. MR 96h:05207.

    Mead, D.G. (1992-10). Newtons Identities. The American Mathematical Monthly (Mathematical Associ-ation of America) 99 (8): 749751. doi:10.2307/2324242. JSTOR 2324242. Check date values in: |date=(help)

    Stanley, Richard P. (1999). Enumerative Combinatorics, Vol. 2. Cambridge University Press. ISBN 0-521-56069-1. (hardback). ISBN 0-521-78987-7 (paperback).

    Sturmfels, Bernd (1992). Algorithms in Invariant Theory. New York: Springer-Verlag. ISBN 978-0-387-82445-1.

    Tucker, Alan (1980). Applied Combinatorics (5/e ed.). New York: Wiley. ISBN 978-0-471-73507-6.

    2.6 External links NewtonGirard formulas on MathWorld A Matrix Proof of Newtons Identities in Mathematics Magazine Application on the number of real roots

  • Chapter 3

    Non-negative matrix factorization

    Illustration of approximate non-negative matrix factorization: the matrix V is represented by the two smaller matrices W and H,which, when multiplied, approximately reconstruct V.

    NMF redirects here. For the bridge convention, see new minor forcing.

    Non-negativematrix factorization (NMF), also non-negativematrix approximation[1][2] is a group of algorithmsin multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matricesW and H, withthe property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easierto inspect. Also, in applications such as processing of audio spectrograms non-negativity is inherent to the data beingconsidered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.NMF nds applications in such elds as computer vision, document clustering,[1] chemometrics, audio signal pro-cessing[3] and recommender systems.[4][5]

    3.1 HistoryIn chemometrics non-negativematrix factorization has a long history under the name selfmodeling curve resolution.[6]In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early workon non-negative matrix factorizations was performed by a Finnish group of researchers in the middle of the 1990sunder the name positive matrix factorization.[7][8] It became more widely known as non-negative matrix factorizationafter Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithmsfor two types of factorizations.[9][10]

    3.2 BackgroundLet matrix V be the product of the matricesW and H,

    V = WH :

    12

  • 3.3. TYPES 13

    Matrix multiplication can be implemented as computing the columns vectors of V as linear combinations of thecolumn vectors in W using coecients supplied by columns of H. That is, each column of V can be computed asfollows:

    vi = Whi ;

    where vi is the ith column vector of the product matrix V and hi is the ith column vector of the matrix H.When multiplying matrices, the dimensions of the factor matrices may be signicantly lower than those of the prod-uct matrix and it is this property that forms the basis of NMF. NMF generates factors with signicantly reduceddimensions compared to the original matrix. For example, if V is an mn matrix, W is an mp matrix, and H is apn matrix then p can be signicantly less than both m and n.Heres an example based on a text-mining application:

    Let the input matrix (the matrix to be factored) be V with 10000 rows and 500 columns where words are inrows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows thata column vector v in V represents a document.

    Assume we ask the algorithm to nd 10 features in order to generate a features matrixW with 10000 rows and10 columns and a coecients matrix H with 10 rows and 500 columns.

    The product ofW and H is a matrix with 10000 rows and 500 columns, the same shape as the input matrix Vand, if the factorization worked, also a reasonable approximation to the input matrix V.

    From the treatment of matrix multiplication above it follows that each column in the product matrix WHis a linear combination of the 10 column vectors in the features matrix W with coecients supplied by thecoecients matrix H.

    This last point is the basis of NMF because we can consider each original document in our example as being builtfrom a small set of hidden features. NMF generates these features.Its useful to think of each feature (column vector) in the features matrix W as a document archetype comprising aset of words where each words cell value denes the words rank in the feature: The higher a words cell value thehigher the words rank in the feature. A column in the coecients matrix H represents an original document with acell value dening the documents rank for a feature. This follows because each row inH represents a feature. We cannow reconstruct a document (column vector) from our input matrix by a linear combination of our features (columnvectors inW where each feature is weighted by the features cell value from the documents column in H.

    3.3 Types

    3.3.1 Approximate non-negative matrix factorization

    Usually the number of columns of W and the number of rows of H in NMF are selected so the product WH willbecome an approximation to V. The full decomposition of V then amounts to the two non-negative matricesW andH as well as a residual U, such that: V = WH + U. The elements of the residual matrix can either be negative orpositive.When W and H are smaller than V they become easier to store and manipulate. Another reason for factorizing Vinto smaller matricesW andH, is that if one is able to approximately represent the elements of V by signicantly lessdata, then one has to infer some latent structure in the data.

    3.3.2 Convex non-negative matrix factorization

    In standard NMF, matrix factor W 2

  • 14 CHAPTER 3. NON-NEGATIVE MATRIX FACTORIZATION

    3.3.3 Nonnegative rank factorizationIn case the nonnegative rank ofV is equal to its actual rank,V=WH is called a nonnegative rank factorization.[12][13][14]The problem of nding the NRF of V, if it exists, is known to be NP-hard.[15]

    3.3.4 Dierent cost functions and regularizationsThere are dierent types of non-negative matrix factorizations. The dierent types arise from using dierent costfunctions for measuring the divergence between V and WH and possibly by regularization of the W and/or Hmatrices.[1]

    Two simple divergence functions studied by Lee and Seung are the squared error (or Frobenius norm) and an exten-sion of the KullbackLeibler divergence to positive matrices (the original KullbackLeibler divergence is dened onprobability distributions). Each divergence leads to a dierent NMF algorithm, usually minimizing the divergenceusing iterative update rules.The factorization problem in the squared error version of NMF may be stated as: Given a matrix V nd nonnegativematrices W and H that minimize the function

    F (W;H) = kVWHk2FAnother type of NMF for images is based on the total variation norm.[16]

    When L1 regularization (akin to Lasso) is added to NMF with the mean squared error cost function, the resultingproblem may be called non-negative sparse coding due to the similarity to the sparse coding problem,[17] althoughit may also still be referred to as NMF.[18]

    3.3.5 Online NMFMany standard NMF algorithms analyze all the data together; i.e., the whole matrix is available from the start. Thismay be unsatisfactory in applications where there are too many data to t into memory or where the data are providedin streaming fashion. One such use is for collaborative ltering in recommendation systems, where there may bemany users and many items to recommend, and it would be inecient to recalculate everything when one user or oneitem is added to the system. The cost function for optimization in these cases may or may not be the same as forstandard NMF, but the algorithms need to be rather dierent.[19][20]

    3.4 AlgorithmsThere are several ways in which theW andHmay be found: Lee and Seungs multiplicative update rule [10] has beena popular method due to the simplicity of implementation. Since then, a few other algorithmic approaches have beendeveloped.Some successful algorithms are based on alternating non-negative least squares: in each step of such an algorithm,rst H is xed andW found by a non-negative least squares solver, thenW is xed and H is found analogously. Theprocedures used to solve forW and H may be the same[21] or dierent, as some NMF variants regularize one ofWandH.[17] Specic approaches include the projected gradient descent methods,[21][22] the active set method,[4][23] andthe block principal pivoting method[24] among several others.The currently available algorithms are sub-optimal as they can only guarantee nding a local minimum, rather thana global minimum of the cost function. A provably optimal algorithm is unlikely in the near future as the problemhas been shown to generalize the k-means clustering problem which is known to be NP-complete.[25] However, as inmany other data mining applications, a local minimum may still prove to be useful.

    3.4.1 Exact NMFExact solutions for the variants of NMF can be expected (in polynomial time) when additional constraints hold formatrix V. A polynomial time algorithm for solving nonnegative rank factorization if V contains a monomial sub

  • 3.5. RELATION TO OTHER TECHNIQUES 15

    matrix of rank equal to its rank was given by Campbell and Poole in 1981.[26] Kalofolias and Gallopoulos (2012)[27]solved the symmetric counterpart of this problem, whereV is symmetric and contains a diagonal principal sub matrixof rank r. Their algorithm runs in O(rm^2) time in the dense case. Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu,& Zhu (2013) give a polynomial time algorithm for exact NMF that works for the case where one of the factors Wsatises the separability condition.[28]

    3.5 Relation to other techniquesIn Learning the parts of objects by non-negative matrix factorization Lee and Seung proposed NMF mainly for parts-based decomposition of images. It compares NMF to vector quantization and principal component analysis, andshows that although the three techniques may be written as factorizations, they implement dierent constraints andtherefore produce dierent results.

    Visible units

    Hidden units

    v1

    v2

    v3

    h3

    h4

    h1

    h2

    NMF as a probabilistic graphical model: visible units (V) are connected to hidden units (H) through weightsW, so that V is generatedfrom a probability distribution with meanPaWiaha .[9]:5It was later shown that some types of NMF are an instance of a more general probabilistic model called multinomialPCA.[29] When NMF is obtained by minimizing the KullbackLeibler divergence, it is in fact equivalent to another

  • 16 CHAPTER 3. NON-NEGATIVE MATRIX FACTORIZATION

    instance of multinomial PCA, probabilistic latent semantic analysis,[30] trained by maximum likelihood estimation.That method is commonly used for analyzing and clustering textual data and is also related to the latent class model.It has been shown [31][32] NMF is equivalent to a relaxed form of K-means clustering: matrix factorW contains clustercentroids andH contains cluster membership indicators, when using the least square as NMF objective. This providestheoretical foundation for using NMF for data clustering. However note that k-means does not enforce non-negativityon its centroids, so the closest analogy is in fact with semi-NMF.[11]

    NMF can be seen as a two-layer directed graphical model with one layer of observed random variables and one layerof hidden random variables.[33]

    NMF extends beyond matrices to tensors of arbitrary order.[34][35][36] This extension may be viewed as a non-negativeversion of, e.g., the PARAFAC model.Other extensions of NMF include joint factorisation of several data matrices and tensors where some factors areshared. Such models are useful for sensor fusion and relational learning.[37]

    NMF is an instance of the nonnegative quadratic programming (NQP) as well as many other important problemsincluding the support vector machine (SVM). However, SVM and NMF are related at a more intimate level thanthat of NQP, which allows direct application of the solution algorithms developed for either of the two methods toproblems in both domains.[38]

    3.6 UniquenessThe factorization is not unique: A matrix and its inverse can be used to transform the two factorization matrices by,e.g.,[39]

    WH = WBB1H

    If the two new matrices ~W = WB and ~H = B1H are non-negative they form another parametrization of thefactorization.The non-negativity of ~W and ~H applies at least if B is a non-negative monomial matrix. In this simple case it willjust correspond to a scaling and a permutation.More control over the non-uniqueness of NMF is obtained with sparsity constraints.[40]

    3.7 Clustering propertyNMF has an inherent clustering property,[31] i.e., it automatically clusters the columns of input dataV = (v1; ; vn).More specically, the approximation of V by V 'WH is achieved by minimizing the error functionminW;H jjV WHjjF ; subject toW 0;H 0:If we add additional orthogonality constraint onH , i.e.,HHT = I , then the above minimization is identical to theminimization of K-means clustering, except for the non-negativity constraints.Furthermore, the computed H gives the cluster indicator, i.e., if Hkj > 0 , that fact indicates input data vj be-longs/assigned to kth cluster. And the computedW gives the cluster centroids, i.e., the kth column gives the clustercentroid of kth cluster.When the orthogonality HHT = I is not explicitly imposed, the orthogonality may hold to a large extent, in whichcase the clustering property holds too, as may be found in some practical applications of NMF.When the error function to be used is KullbackLeibler divergence, NMF is identical to the Probabilistic latentsemantic analysis, a popular document clustering method.[41]

    3.8 Applications

  • 3.8. APPLICATIONS 17

    3.8.1 Text mining

    NMF can be used for text mining applications. In this process, a document-termmatrix is constructed with the weightsof various terms (typically weighted word frequency information) from a set of documents. This matrix is factoredinto a term-feature and a feature-document matrix. The features are derived from the contents of the documents, andthe feature-document matrix describes data clusters of related documents.One specic application used hierarchical NMF on a small subset of scientic abstracts from PubMed.[42] Another re-search group clustered parts of the Enron email dataset[43] with 65,033messages and 91,133 terms into 50 clusters.[44]NMF has also been applied to citations data, with one example clustering Wikipedia articles and scientic journalsbased on the outbound scientic citations in Wikipedia.[45]

    Arora, Ge, Halpern, Mimno, Moitra, Sontag, Wu, & Zhu (2013) have given polynomial-time algorithms to learntopic models using NMF. The algorithm assumes that the topic matrix satises a separability condition that is oftenfound to hold in these settings. [28]

    3.8.2 Spectral data analysis

    NMF is also used to analyze spectral data; one such use is in the classication of space objects and debris.[46]

    3.8.3 Scalable Internet distance prediction

    NMF is applied in scalable Internet distance (round-trip time) prediction. For a network withN hosts, with the helpof NMF, the distances of all the N2 end-to-end links can be predicted after conducting only O(N) measurements.This kind of method was rstly introduced in Internet Distance Estimation Service (IDES).[47] Afterwards, as a fullydecentralized approach, Phoenix network coordinate system [48] is proposed. It achieves better overall predictionaccuracy by introducing the concept of weight.

    3.8.4 Non-stationary speech denoising

    Speech denoising has been a long lasting problem in audio signal processing. There are lots of algorithms for denoisingif the noise is stationary. For example, theWiener lter is suitable for additive Gaussian noise. However, if the noise isnon-stationary, the classical denoising algorithms usually have poor performance because the statistical informationof the non-stationary noise is dicult to estimate. Schmidt et al.[49] use NMF to do speech denoising under non-stationary noise, which is completely dierent from classical statistical approaches.The key idea is that clean speechsignal can be sparsely represented by a speech dictionary, but non-stationary noise cannot. Similarly, non-stationarynoise can also be sparsely represented by a noise dictionary, but speech cannot.The algorithm for NMF denoising goes as follows. Two dictionaries, one for speech and one for noise, need to betrained oine. Once a noisy speech is given, we rst calculate the magnitude of the Short-Time-Fourier-Transform.Second, separate it into two parts via NMF, one can be sparsely represented by the speech dictionary, and the otherpart can be sparsely represented by the noise dictionary. Third, the part that is represented by the speech dictionarywill be the estimated clean speech.

    3.8.5 Bioinformatics

    NMF has been successfully applied in bioinformatics for clustering gene expression data and nding the genes mostrepresentative of the clusters.[50][51]

    3.8.6 Nuclear Imaging

    NMF, also referred in this eld as factor analysis, has been used since the 80s [52] to analyze sequences of images inSPECT and PET dynamic medical imaging. Non-uniqueness of NMF was addressed using sparsity constraints.[53]

  • 18 CHAPTER 3. NON-NEGATIVE MATRIX FACTORIZATION

    3.9 Current researchCurrent research in nonnegative matrix factorization includes, but not limited to,(1) Algorithmic: searching for global minima of the factors and factor initialization.[54]

    (2) Scalability: how to factorize million-by-billion matrices, which are commonplace in Web-scale data mining, e.g.,see Distributed Nonnegative Matrix Factorization (DNMF)[55]

    (3) Online: how to update the factorization when new data comes in without recomputing from scratch, e.g., seeonline CNSC [56]

    (4) Collective (joint) factorization: factorizing multiple interrelated matrices for multiple-view learning, e.g. mutli-view clustering, see CoNMF [57] and MultiNMF [58]

    3.10 See also Multilinear algebra Multilinear subspace learning Tensor Tensor decomposition Tensor software

    3.11 Sources and external links

    3.11.1 Notes[1] Inderjit S. Dhillon, Suvrit Sra (2005). Generalized Nonnegative Matrix Approximations with Bregman Divergences (PDF).

    NIPS.

    [2] Tandon, Rashish; Suvrit Sra (2010). Sparse nonnegative matrix approximation: new formulations and algorithms (PDF).TR.

    [3] Wang, Wenwu (2010). Instantaneous Versus Convolutive Non-Negative Matrix Factorization: Models, Algorithms andApplications to Audio Pattern Separation. In Wang, Wenwu. Machine Audition: Principles, Algorithms and Systems. IGIGlobal. pp. 353370. doi:10.4018/978-1-61520-919-4.ch015.

    [4] Rainer Gemulla, Erik Nijkamp, Peter J Haas, Yannis Sismanis (2011). Large-scale matrix factorization with distributedstochastic gradient descent (PDF). Proc. ACM SIGKDD Int'l Conf. on Knowledge discovery and data mining. pp. 6977.

    [5] Yang Bao et al (2014). TopicMF: Simultaneously Exploiting Ratings and Reviews for Recommendation. AAAI.

    [6] WilliamH. Lawton; EdwardA. Sylvestre (1971). Selfmodeling curve resolution. Technometrics 13 (3): 617+. doi:10.2307/1267173.

    [7] P. Paatero, U. Tapper (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of errorestimates of data values. Environmetrics 5 (2): 111126. doi:10.1002/env.3170050203.

    [8] Pia Anttila, Pentti Paatero, Unto Tapper, Olli Jrvinen (1995). Source identication of bulk wet deposition in Finland bypositive matrix factorization. Atmospheric Environment 29 (14): 17051718. doi:10.1016/1352-2310(94)00367-T.

    [9] Daniel D. Lee and H. Sebastian Seung (1999). Learning the parts of objects by non-negative matrix factorization. Nature401 (6755): 788791. doi:10.1038/44565. PMID 10548103.

    [10] Daniel D. Lee and H. Sebastian Seung (2001). Algorithms for Non-negative Matrix Factorization. Advances in NeuralInformation Processing Systems 13: Proceedings of the 2000 Conference. MIT Press. pp. 556562.

    [11] C Ding, T Li, MI Jordan, Convex and semi-nonnegative matrix factorizations, IEEE Transactions on Pattern Analysis andMachine Intelligence, 32, 45-55, 2010

    [12] Berman, A.; R.J. Plemmons (1974). Inverses of nonnegative matrices. Linear and Multilinear Algebra 2: 161172.doi:10.1080/03081087408817055.

  • 3.11. SOURCES AND EXTERNAL LINKS 19

    [13] A. Berman, R.J. Plemmons (1994). Nonnegative matrices in the Mathematical Sciences. Philadelphia: SIAM.

    [14] Thomas, L.B. (1974). Problem 73-14, Rank factorization of nonnegative matrices. SIAM rev. 16 (3): 393394.doi:10.1137/1016064.

    [15] Vavasis, S.A. (2009). On the complexity of nonnegativematrix factorization. SIAM J. Optim. 20: 13641377. doi:10.1137/070709967.

    [16] Zhang, T.; Fang, B.; Liu, W.; Tang, Y. Y.; He, G.; Wen, J. (2008). Total variation norm-based nonnegative matrixfactorization for identifying discriminant representation of image patterns. Neurocomputing 71 (1012): 18241831.doi:10.1016/j.neucom.2008.01.022.

    [17] Hoyer, Patrik O. (2002). Non-negative sparse coding. Proc. IEEE Workshop on Neural Networks for Signal Processing.

    [18] Hsieh, C. J.; Dhillon, I. S. (2011). Fast coordinate descent methods with variable selection for non-negative matrix factor-ization (PDF). Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining- KDD '11. p. 1064. doi:10.1145/2020408.2020577. ISBN 9781450308137.

    [19] http://www.ijcai.org/papers07/Papers/IJCAI07-432.pdf

    [20] http://portal.acm.org/citation.cfm?id=1339264.1339709

    [21] Lin, Chih-Jen (2007). Projected Gradient Methods for Nonnegative Matrix Factorization (PDF). Neural Computation19 (10): 27562779. doi:10.1162/neco.2007.19.10.2756. PMID 17716011.

    [22] Lin, Chih-Jen (2007). On the Convergence of Multiplicative Update Algorithms for Nonnegative Matrix Factorization.IEEE Transactions on Neural Networks 18 (6): 1589. doi:10.1109/TNN.2007.895831.

    [23] Hyunsoo Kim and Haesun Park (2008). Nonnegative Matrix Factorization Based on Alternating Nonnegativity Con-strained Least Squares and Active Set Method (PDF). SIAM Journal on Matrix Analysis and Applications 30 (2): 713730.doi:10.1137/07069239x.

    [24] JinguKim andHaesun Park (2011). Fast NonnegativeMatrix Factorization: AnActive-set-likeMethod andComparisons(PDF). SIAM Journal on Scientic Computing 33 (6): 32613281. doi:10.1137/110821172.

    [25] Ding, C. and He, X. and Simon, H.D., (2005). On the equivalence of nonnegative matrix factorization and spectralclustering. Proc. SIAM Data Mining Conf 4: 606610. doi:10.1137/1.9781611972757.70.

    [26] Campbell, S.L.; G.D. Poole (1981). Computing nonnegative rank factorizations.. Linear Algebra Appl. 35: 175182.doi:10.1016/0024-3795(81)90272-x.

    [27] Kalofolias, V.; Gallopoulos, E. (2012). Computing symmetric nonnegative rank factorizations. Linear Algebra Appl 436:421435. doi:10.1016/j.laa.2011.03.016.

    [28] Arora, Sanjeev; Ge, Rong; Halpern, Yoni; Mimno, David; Moitra, Ankur; Sontag, David; Wu, Yichen; Zhu, Michael(2013). Apractical algorithm for topic modeling with provable guarantees. Proceedings of the 30th International Conferenceon Machine Learning. arXiv:1212.4777.

    [29] Wray Buntine (2002). Variational Extensions to EM and Multinomial PCA (PDF). Proc. European Conference on MachineLearning (ECML-02). LNAI 2430. pp. 2334.

    [30] Eric Gaussier and Cyril Goutte (2005). Relation between PLSA and NMF and Implications (PDF). Proc. 28th internationalACM SIGIR conference on Research and development in information retrieval (SIGIR-05). pp. 601602.

    [31] C. Ding, X. He, H.D. Simon (2005). On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering.Proc. SIAM Int'l Conf. Data Mining, pp. 606-610. May 2005

    [32] Ron Zass and Amnon Shashua (2005). "A Unifying Approach to Hard and Probabilistic Clustering". International Con-ference on Computer Vision (ICCV) Beijing, China, Oct., 2005.

    [33] Max Welling et al (2004). Exponential Family Harmoniums with an Application to Information Retrieval. NIPS.

    [34] Pentti Paatero (1999). TheMultilinear Engine: A Table-Driven, Least Squares Program for SolvingMultilinear Problems,including the n-Way Parallel Factor Analysis Model. Journal of Computational and Graphical Statistics 8 (4): 854888.doi:10.2307/1390831. JSTOR 1390831.

    [35] MaxWelling and Markus Weber (2001). Positive Tensor Factorization. Pattern Recognition Letters 22 (12): 12551261.doi:10.1016/S0167-8655(01)00070-8.

    [36] Jingu Kim and Haesun Park (2012). Fast Nonnegative Tensor Factorization with an Active-set-like Method (PDF). High-Performance Scientic Computing: Algorithms and Applications. Springer. pp. 311326.

  • 20 CHAPTER 3. NON-NEGATIVE MATRIX FACTORIZATION

    [37] Kenan Yilmaz and A. Taylan Cemgil and Umut Simsekli (2011). Generalized Coupled Tensor Factorization (PDF). NIPS.

    [38] Vamsi K. Potluru and Sergey M. Plis and Morten Morup and Vince D. Calhoun and Terran Lane (2009). Ecient Mul-tiplicative updates for Support Vector Machines. Proceedings of the 2009 SIAM Conference on Data Mining (SDM). pp.12181229.

    [39] Wei Xu, Xin Liu & Yihong Gong (2003). Document clustering based on non-negative matrix factorization. Proceedings ofthe 26th annual international ACM SIGIR conference on Research and development in information retrieval. New York:Association for Computing Machinery. pp. 267273.

    [40] Julian Eggert, Edgar Krner, "Sparse coding and NMF", Proceedings. 2004 IEEE International Joint Conference on NeuralNetworks, 2004, pp. 2529-2533, 2004.

    [41] C Ding, T Li, W Peng, " On the equivalence between non-negative matrix factorization and probabilistic latent semanticindexing Computational Statistics & Data Analysis 52, 3913-3927

    [42] Nielsen, Finn rup; Balslev, Daniela; Hansen, Lars Kai (2005). Mining the posterior cingulate: segregation betweenmemory and pain components. NeuroImage 27 (3): 520522. doi:10.1016/j.neuroimage.2005.04.034. PMID 15946864.

    [43] Cohen, William (2005-04-04). Enron Email Dataset. Retrieved 2008-08-26.

    [44] Berry,MichaelW.; Browne,Murray (2005). Email SurveillanceUsingNon-negativeMatrix Factorization. Computationaland Mathematical Organization Theory 11 (3): 249264. doi:10.1007/s10588-005-5380-5.

    [45] Nielsen, Finn rup (2008). Clustering of scientic citations in Wikipedia. Wikimania.

    [46] Michael W. Berry et al. (2006). Algorithms and Applications for Approximate Nonnegative Matrix Factorization.

    [47] Yun Mao, Lawrence Saul and Jonathan M. Smith (2006). IDES: An Internet Distance Estimation Service for LargeNetworks. IEEE Journal on Selected Areas in Communications 24 (12): 22732284. doi:10.1109/JSAC.2006.884026.

    [48] Yang Chen, XiaoWang, Cong Shi, and et al. (2011). Phoenix: AWeight-based Network Coordinate SystemUsingMatrixFactorization (PDF). IEEETransactions onNetwork and ServiceManagement 8 (4): 334347. doi:10.1109/tnsm.2011.110911.100079.

    [49] Schmidt, M.N., J. Larsen, and F.T. Hsiao. (2007). Wind noise reduction using non-negative sparse coding, MachineLearning for Signal Processing, IEEE Workshop on, 431436

    [50] Devarajan, K. (2008). Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology.PLoS Computational Biology 4 (7).

    [51] HyunsooKim andHaesun Park (2007). Sparse non-negativematrix factorizations via alternating non-negativity-constrainedleast squares for microarray data analysis. Bioinformatics 23 (12): 14951502. doi:10.1093/bioinformatics/btm134.PMID 17483501.

    [52] DiPaola; Bazin; Aubry; Aurengo; Cavailloles; Herry; Kahn (1982). Handling of dynamic sequences in in nuclearmedicine.IEEE Trans Nucl Sci. NS-29: 131021.

    [53] Sitek; Gullberg; Huesman (2002). Correction for ambiguous solutions in factor analysis using a penalized least squaresobjective. IEEE Trans Med Imaging 21 (3): 21625.

    [54] C. Boutsidis and E. Gallopoulos (2008). SVD based initialization: A head start for nonnegative matrix factorization.Pattern Recognition 41 (4): 13501362. doi:10.1016/j.patcog.2007.09.010.

    [55] Chao Liu, Hung-chih Yang, Jinliang Fan, Li-Wei He, and Yi-Min Wang (2010). Distributed Nonnegative Matrix Fac-torization for Web-Scale Dyadic Data Analysis on MapReduce (PDF). Proceedings of the 19th International World WideWeb Conference.

    [56] Dong Wang, Ravichander Vipperla, Nick Evans, Thomas Fang Zheng (2013). Online Non-Negative Convolutive PatternLearning for Speech Signals (PDF). IEEE Transactions on Signal Processing.

    [57] Xiangnan He, Min-Yen Kan, Peichu Xie and Xiao Chen (2014). Comment-based Multi-View Clustering of Web 2.0Items (PDF). Proceedings of the 23rd International World Wide Web Conference.

    [58] Jialu Liu, ChiWang, Jing Gao and Jiawei Han (2013). Multi-View Clustering via Joint Nonnegative Matrix Factorization(PDF). Proceedings of SIAM Data Mining Conference.

  • 3.11. SOURCES AND EXTERNAL LINKS 21

    3.11.2 Others J. Shen, G. W. Isral (1989). A receptor model using a specic non-negative transformation technique forambient aerosol. Atmospheric Environment 23 (10): 22892298. doi:10.1016/0004-6981(89)90190-X.

    Pentti Paatero (1997). Least squares formulation of robust non-negative factor analysis. Chemometrics andIntelligent Laboratory Systems 37 (1): 2335. doi:10.1016/S0169-7439(96)00044-5.

    Raul Kompass (2007). A Generalized Divergence Measure for Nonnegative Matrix Factorization. NeuralComputation 19 (3): 780791. doi:10.1162/neco.2007.19.3.780. PMID 17298233.

    Liu, W.X. and Zheng, N.N. and You, Q.B. (2006). Nonnegative Matrix Factorization and its applications inpattern recognition (PDF). Chinese Science Bulletin 51 (1718): 718. doi:10.1007/s11434-005-1109-6.

    Ngoc-Diep Ho, Paul Van Dooren and Vincent Blondel (2008). Descent Methods for Nonnegative MatrixFactorization. arXiv:0801.3199 [cs.NA].

    Andrzej Cichocki, Rafal Zdunek and Shun-ichi Amari (2008). NonnegativeMatrix and Tensor Factorization.IEEE Signal Processing Magazine 25 (1): 142145. doi:10.1109/MSP.2008.4408452.

    Cdric Fvotte, Nancy Bertin, and Jean-Louis Durrieu (2009). Nonnegative Matrix Factorization with theItakura-SaitoDivergence: WithApplication toMusicAnalysis. Neural Computation 21 (3): 793830. doi:10.1162/neco.2008.04-08-771. PMID 18785855.

    Ali Taylan Cemgil (2009). Bayesian Inference for Nonnegative Matrix Factorisation Models. ComputationalIntelligence and Neuroscience 2009 (2): 1. doi:10.1155/2009/785152. PMC 2688815. PMID 19536273.

  • Chapter 4

    Nonlinear eigenproblem

    A nonlinear eigenproblem is a generalization of an ordinary eigenproblem to equations that depend nonlinearly onthe eigenvalue. Specically, it refers to equations of the form:

    A()x = 0;where x is a vector (the nonlinear eigenvector) and A is a matrix-valued function of the number (the nonlineareigenvalue). (More generally, A() could be a linear map, but most commonly it is a nite-dimensional, usuallysquare, matrix.) A is usually required to be a holomorphic function of (in some domain).For example, an ordinary linear eigenproblemBv = v , where B is a square matrix, corresponds toA() = BI, where I is the identity matrix.One common case is whereA is a polynomial matrix, which is called a polynomial eigenvalue problem. In particular,the specic case where the polynomial has degree two is called a quadratic eigenvalue problem, and can be writtenin the form:

    A()x = (A22 +A1+A0)x = 0;in terms of the constant squarematricesA,,. This can be converted into an ordinary linear generalized eigenproblemof twice the size by dening a new vector y = x . In terms of x and y, the quadratic eigenvalue problem becomes:

    A0 00 I

    xy

    =

    A1 A2I 0

    xy;

    where I is the identity matrix. More generally, if A is a matrix polynomial of degree d, then one can convert thenonlinear eigenproblem into a linear (generalized) eigenproblem of d times the size.Besides converting them to ordinary eigenproblems, which only works if A is polynomial, there are other methods ofsolving nonlinear eigenproblems based on the Jacobi-Davidson algorithm or based on Newtons method (related toinverse iteration).

    4.1 References Franoise Tisseur and Karl Meerbergen, The quadratic eigenvalue problem, SIAM Review 43 (2), 235-286(2001).

    Gene H. Golub and Henk A. van der Vorst, Eigenvalue computation in the 20th century, Journal of Compu-tational and Applied Mathematics 123, 35-65 (2000).

    Philippe Guillaume, Nonlinear eigenproblems, SIAM J. Matrix. Anal. Appl. 20 (3), 575-595 (1999). Axel Ruhe, Algorithms for the nonlinear eigenvalue problem, SIAM Journal on Numerical Analysis 10 (4),674-689 (1973).

    22

  • Chapter 5

    Nonnegative rank (linear algebra)

    In linear algebra, the nonnegative rank of a nonnegative matrix is a concept similar to the usual linear rank of a realmatrix, but adding the requirement that certain coecients and entries of vectors/matrices have to be nonnegative.For example, the linear rank of a matrix is the smallest number of vectors, such that every column of the matrix canbe written as a linear combination of those vectors. For the nonnegative rank, it is required that the vectors must havenonnegative entries, and also that the coecients in the linear combinations are nonnegative.

    5.1 Formal DenitionThere are several equivalent denitions, all modifying the denition of the linear rank slightly. Apart from the def-inition given above, there is the following: The nonnegative rank of a nonnegative mn-matrix A is equal to thesmallest number q such there exists a nonnegative mq-matrix B and a nonnegative qn-matrix C such that A = BC(the usual matrix product). To obtain the linear rank, drop the condition that B and C must be nonnegative.Further, the nonnegative rank is the smallest number of nonnegative rank-one matrices into which the matrix can bedecomposed additively:

    rank+(A) = minfq jPq

    j=1Rj = A; rankR1 = = rankRq = 1; R1; : : : ; Rq 0g;

    where Rj 0 stands for "Rj is nonnegative.[1] (To obtain the usual linear rank, drop the condition that the Rj have tobe nonnegative.)Given a nonnegativem n matrix A the nonnegative rank rank+(A) of A satises

    rank (A) rank+(A) min(m;n); where rank(A) denotes the usual linear rank of A.

    5.1.1 A Fallacy

    The rank of the matrix A is the largest number of columns which are linearly independent, i.e., none of the selectedcolumns can be written as a linear combination of the other selected columns. It is not true that adding nonnegativityto this characterization gives the nonnegative rank: The nonnegative rank is in general strictly greater than the largestnumber of columns such that no selected column can be written as a nonnegative linear combination of the otherselected columns.

    5.2 Connection with the linear rankIt is always true that rank(A) rank+(A). In fact rank+(A) = rank(A) holds whenever rank(A) 2 [2].In the case rank(A) 3, however, rank(A) < rank+(A) is possible. For example, the matrix

    23

  • 24 CHAPTER 5. NONNEGATIVE RANK (LINEAR ALGEBRA)

    A =

    26641 1 0 01 0 1 00 1 0 10 0 1 1

    3775;satises rank(A) = 3 < 4 = rank+(A) [2].

    5.3 Computing the nonnegative rankThe nonnegative rank of a matrix can be determined algorithmically.[2]

    It has been proved that determining whether rank+(A) = rank(A) is NP-hard.[3]

    Obvious questions concerning the complexity of nonnegative rank computation remain unanswered to date. Forexample, the complexity of determining the nonnegative rank of matrices of xed rank k is unknown for k > 2.

    5.4 Ancillary FactsNonnegative rank has important applications in Combinatorial optimization:[4] The minimum number of facets of anextension of a polyhedron P is equal to the nonnegative rank of its so-called slack matrix.[5]

    5.5 References[1] Abraham Berman and Robert J. Plemmons. Nonnegative Matrices in the Mathematical Sciences, SIAM

    [2] J. Cohen and U. Rothblum. Nonnegative ranks, decompositions and factorizations of nonnegative matrices. LinearAlgebra and its Applications, 190:149168, 1993.

    [3] Stephen Vavasis, On the complexity of nonnegative matrix factorization, SIAM Journal on Optimization 20 (3) 1364-1377,2009.

    [4] Mihalis Yannakakis. Expressing combinatorial optimization problems by linear programs. J. Comput. System Sci.,43(3):441466, 1991.

    [5] See this blog post

  • Chapter 6

    Norm (mathematics)

    This article is about linear algebra and analysis. For eld theory, see Field norm. For ideals, see Ideal norm. Forgroup theory, see Norm (group). For norms in descriptive set theory, see prewellordering.

    In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictlypositive length or size to each vector in a vector spacesave possibly for the zero vector, which is assigned a lengthof zero. A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition tothe zero vector).A norm must also satisfy certain properties pertaining to scalability and additivity which are given in the formaldenition below.A simple example is the 2-dimensional Euclidean space R2 equipped with the Euclidean norm. Elements in thisvector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional cartesian coordinate system starting at theorigin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean normis often known as the magnitude.A vector space on which a norm is dened is called a normed vector space. Similarly, a vector space with a seminormis called a seminormed vector space. It is often possible to supply a norm for a given vector space in more than oneway.

    6.1 DenitionGiven a vector space V over a subeld F of the complex numbers, a norm on V is a function p: V R with thefollowing properties:[1]

    For all a F and all u, v V,

    1. p(av) = |a| p(v), (absolute homogeneity or absolute scalability).2. p(u + v) p(u) + p(v) (triangle inequality or subadditivity).3. If p(v) = 0 then v is the zero vector (separates points).

    By the rst axiom, absolute homogeneity, we have p(0) = 0 and p(v) = p(v), so that by the triangle inequality

    p(v) 0 (positivity).

    A seminorm on V is a function p : V R with the properties 1. and 2. above.Every vector space V with seminorm p induces a normed space V/W, called the quotient space, where W is thesubspace of V consisting of all vectors v in V with p(v) = 0. The induced norm on V/W is clearly well-dened andis given by:

    p(W + v) = p(v).

    25

  • 26 CHAPTER 6. NORM (MATHEMATICS)

    Two norms (or seminorms) p and q on a vector space V are equivalent if there exist two real constants c and C, withc > 0 such that

    for every vector v in V, one has that: c q(v) p(v) C q(v).

    A topological vector space is called normable (seminormable) if the topology of the space can be induced by anorm (seminorm).

    6.2 NotationIf a norm p : V R is given on a vector space V then the norm of a vector v V is usually denoted by enclosing itwithin double vertical lines: v = p(v). Such notation is also sometimes used if p is only a seminorm.For the length of a vector in Euclidean space (which is an example of a norm, as explained below), the notation |v|with single vertical lines is also widespread.In Unicode, the codepoint of the double vertical line character is U+2016. The double vertical line should notbe confused with the parallel to symbol, Unicode U+2225 ( ). This is usually not a problem because the formeris used in parenthesis-like fashion, whereas the latter is used as an inx operator. The double vertical line used hereshould also not be confused with the symbol used to denote lateral clicks, Unicode U+01C1 ( ). The single verticalline | is called vertical line in Unicode and its codepoint is U+007C.

    6.3 Examples All norms are seminorms. The trivial seminorm has p(x) = 0 for all x in V. Every linear form f on a vector space denes a seminorm by x |f(x)|.

    6.3.1 Absolute-value norm

    The absolute value

    kxk = jxj

    is a norm on the one-dimensional vector spaces formed by the real or complex numbers.

    6.3.2 Euclidean norm

    Main article: Euclidean distance

    On an n-dimensional Euclidean space Rn, the intuitive notion of length of the vector x = (x1, x2, ..., xn) is capturedby the formula

    kxk :=qx21 + + x2n:

    This gives the ordinary distance from the origin to the point x, a consequence of the Pythagorean theorem. TheEuclidean norm is by far the most commonly used norm on Rn, but there are other norms on this vector space as willbe shown below. However all these norms are equivalent in the sense that they all dene the same topology.On an n-dimensional complex space Cn the most common norm is

  • 6.3. EXAMPLES 27

    kzk :=qjz1j2 + + jznj2 =

    pz1z1 + + znzn:

    In both cases we can also express the norm as the square root of the inner product of the vector and itself:

    kxk := px x;

    where x is represented as a column vector ([x1; x2; ...; xn]), and x denotes its conjugate transpose.This formula is valid for any inner product space, including Euclidean and complex spaces. For Euclidean spaces,the inner product is equivalent to the dot product. Hence, in this specic case the formula can be also written withthe following notation:

    kxk := px x:

    The Euclidean norm is also called the Euclidean length, L2 distance, 2 distance, L2 norm, or 2 norm; see Lpspace.The set of vectors in Rn+1 whose Euclidean norm is a given positive constant forms an n-sphere.

    Euclidean norm of a complex number

    The Euclidean norm of a complex number is the absolute value (also called themodulus) of it, if the complex plane isidentied with the Euclidean plane R2. This identication of the complex number x + iy as a vector in the Euclideanplane, makes the quantity

    px2 + y2 (as rst suggested by Euler) the Euclidean norm associated with the complex

    number.

    6.3.3 Taxicab norm or Manhattan normMain article: Taxicab geometry

    kxk1 :=nXi=1

    jxij :

    The name relates to the distance a taxi has to drive in a rectangular street grid to get from the origin to the point x.The set of vectors whose 1-norm is a given constant forms the surface of a cross polytope of dimension equivalentto that of the norm minus 1. The Taxicab norm is also called the L1 norm. The distance derived from this norm iscalled the Manhattan distance or L1 distance.The 1-norm is simply the sum of the absolute values of the columns.In contrast,

    nXi=1

    xi

    is not a norm because it may yield negative results.

    6.3.4 p-normMain article: Lp space

    Let p 1 be a real number.

  • 28 CHAPTER 6. NORM (MATHEMATICS)

    kxkp := nX

    i=1

    jxijp1/p

    :

    Note that for p = 1 we get the taxicab norm, for p = 2 we get the Euclidean norm, and as p approaches1 the p-normapproaches the innity norm or maximum norm. Note that the p-norm is related to the Hlder mean.This denition is still of some interest for 0 < p < 1, but the resulting function does not dene a norm,[2] becauseit violates the triangle inequality. What is true for this case of 0 < p < 1, even in the measurable analog, is that thecorresponding Lp class is a vector space, and it is also true that the function

    ZX

    jf(x) g(x)jp d

    (without pth root) denes a distance that makes Lp(X) into a complete metric topological vector space. These spacesare of great interest in functional analysis, probability theory, and harmonic analysis. However, outside trivial cases,this topological vector space is not locally convex and has no continuous nonzero linear forms. Thus the topologicaldual space contains only the zero functional.The derivative of the p-norm is given by

    @

    @xkkxkp =

    xk jxkjp2kxkp1p

    :

    For the special case of p = 2, this becomes

    @

    @xkkxk2 =

    xkkxk2

    ;

    or

    @

    @x kxk2 =x

    kxk2:

    6.3.5 Maximumnorm (special case of: innity norm, uniformnorm, or supremumnorm)Main article: Maximum norm

    kxk1 := max (jx1j ; : : : ; jxnj) :The set of vectors whose innity norm is a given constant, c, forms the surface of a hypercube with edge length 2c.

    6.3.6 Zero normIn probability and functional analysis, the zero norm induces a complete metric topology for the space of measureablefunctions and for the F-space of sequences with Fnorm (xn) 7!

    Pn 2

    nxn/(1 + xn) , which is discussed by StefanRolewicz in Metric Linear Spaces.[3]

    Hamming distance of a vector from zero

    See also: Hamming distance and discrete metric

    In metric geometry, the discrete metric takes the value one for distinct points and zero otherwise. When appliedcoordinate-wise to the elements of a vector space, the discrete distance denes the Hamming distance, which is

  • 6.3. EXAMPLES 29

    x

    kxk1 = 1

    important in coding and information theory. In the eld of real or complex numbers, the distance of the discretemetric from zero is not homogeneous in the non-zero point; indeed, the distance from zero remains one as its non-zeroargument approaches zero. However, the discrete distance of a number from zero does satisfy the other properties of anorm, namely the triangle inequality and positive deniteness. When applied component-wise to vectors, the discretedistance from zero behaves like a non-homogeneous norm, which counts the number of non-zero components inits vector argument; again, this non-homogeneous norm is discontinuous.In signal processing and statistics, David Donoho referred to the zero "norm" with quotation marks. FollowingDonohos notation, the zero norm of x is simply the number of non-zero coordinates of x, or the Hamming distanceof the vector from zero. When this norm is localized to a bounded set, it is the limit of p-norms as p approaches0. Of course, the zero norm is not a B-norm, because it is not positive homogeneous. It is not even an F-norm,because it is discontinuous, jointly and severally, with respect to the scalar argument in scalarvector multiplicationand with respect to its vector argument. Abusing terminology, some engineers omit Donohos quotation marks andinappropriately call the number-of-nonzeros function the L0 norm (sic.), also misusing the notation for the Lebesguespace of measurable functions.

  • 30 CHAPTER 6. NORM (MATHEMATICS)

    6.3.7 Other normsOther norms on Rn can be constructed by combining the above; for example

    kxk := 2 jx1j+q

    3 jx2j2 +max(jx3j ; 2 jx4j)2

    is a norm on R4.For any norm and any injective linear transformation A we can dene a new norm of x, equal to

    kAxk :In 2D, with A a rotation by 45 and a suitable scaling, this changes the taxicab norm into the maximum norm.In 2D, each A applied to the taxicab norm, up to inversion and interchanging of axes, gives a dierent unit ball:a parallelogram of a particular shape, size and orientation. In 3D this is similar but dierent for the 1-norm(octahedrons) and the maximum norm (prisms with parallelogram base).All the above formulas also yield norms on Cn without modication.

    6.3.8 Innite-dimensional caseThe generalization of the above norms to an innite number of components leads to the Lp spaces, with norms

    kxkp =X

    i2Njxijp

    1/presp. kfkp;X =

    ZX

    jf(x)jp dx1/p

    (for complex-valued sequences x resp. functions f dened on X R ), which can be further generalized (see Haarmeasure).Any inner product induces in a natural way the norm kxk :=phx; xi:Other examples of innite dimensional normed vector spaces can be found in the Banach space article.

    6.4 PropertiesThe concept of unit circle (the set of all vectors of norm 1) is dierent in dierent norms: for the 1-norm the unitcircle in R2 is a square, for the 2-norm (Euclidean norm) it is the well-known unit circle, while for the innity normit is a dierent square. For any p-norm it is a superellipse (with congruent axes). See the accompanying illustration.Note that due to the denition of the norm, the unit circle is always convex and centrally symmetric (therefore, forexample, the unit ball may be a rectangle but cannot be a triangle).In terms of the vector space, the seminorm denes a topology on the space, and this is a Hausdor topology preciselywhen the seminorm can distinguish between distinct vectors, which is again equivalent to the seminorm being a norm.The topology thus dened (by either a norm or a seminorm) can be understood either in terms of sequences or opensets. A sequence of vectors fvng is said to converge in norm to v if kvn vk ! 0 as n ! 1 . Equivalently, thetopology consists of all sets that can be represented as a union of open balls.Two norms and on a vector space V are called equivalent if there exist positive real numbers C and D suchthat for all x in V

    C kxk kxk D kxk :For instance, on Cn , if p > r > 0, then

    kxkp kxkr n(1/r1/p) kxkp :

  • 6.4. PROPERTIES 31

    In particular,

    kxk2 kxk1 pn kxk2

    kxk1 kxk2 pn kxk1

    kxk1 kxk1 n kxk1 :If the vector space is a nite-dimensional real or complex one, all norms are equivalent. On the other hand, in thecase of innite-dimensional vector spaces, not all norms are equivalent.Equivalent norms dene the same notions of continuity and convergence and for many purposes do not need to bedistinguished. To be more precise the uniform structure dened by equivalent norms on the vector space is uniformlyisomorphic.Every (semi)-norm is a sublinear function, which implies that every norm is a convex function. As a result, nding aglobal optimum of a norm-based objective function is often tractable.Given a nite family of seminorms pi on a vector space the sum

    p(x) :=nXi=0

    pi(x)

    is again a seminorm.For any norm p on a vector space V, we have that for all u and v V :

    p(u v) |p(u) p(v)|.

    Proof: Applying the triangular inequality to both p(u 0) and p(v 0) :

    p(u 0) p(u v) + p(v 0)) p(u v) p(u) p(v)

    p(u 0) p(u+ v) + p(0 v)) p(u+ v) p(u) p(v)

    p(v 0) p(u v) + p(u 0)) p(u v) p(v) p(u)

    p(v 0) p(u+ v) + p(0 u)) p(u+ v) p(v) p(u)Thus, p(u v) |p(u) p(v)|.

    If X and Y are normed spaces and u : X ! Y is a continuous linear map, then the norm of u and the norm of thetranspose of u are equal.[4]

    For the lp norms, we have Hlders inequality[5]

    xTy kxkp kykq 1p + 1q = 1:A special case of this is the CauchySchwarz inequality:[5]

    xTy kxk2 kyk2 :

  • 32 CHAPTER 6. NORM (MATHEMATICS)

    6.5 Classication of seminorms: absolutely convex absorbing setsAll seminorms on a vector space V can be classied in terms of absolutely convex absorbing sets in V. To each suchset, A, corresponds a seminorm pA called the gauge of A, dened as

    pA(x) := inf{ : > 0, x A}

    with the property that

    {x : pA(x) < 1} A {x : pA(x) 1}.

    Conversely:Any locally convex topological vector space has a local basis consisting of absolutely convex sets. A common methodto construct such a basis is to use a family (p) of seminorms p that separates points: the collection of all nite inter-sections of sets {p < 1/n} turns the space into a locally convex topological vector space so that every p is continuous.Such a method is used to design weak and weak* topologies.norm case:

    Suppose now that (p) contains a single p: since (p) is separating, p is a norm, and A = {p < 1} is its openunit ball. Then A is an absolutely convex bounded neighbourhood of 0, and p = pA is continuous.

    The converse is due to Kolmogorov: any locally convex and locally bounded topological vector space isnormable. Precisely:If V is an absolutely convex bounded neighbourhood of 0, the gauge gV (so that V = {gV < 1}) is anorm.

    6.6 GeneralizationsThere are several generalizations of norms and semi-norms. If p is absolute homogeneity but in place of subadditivitywe require thatthen p satises the triangle inequality but is called a quasi-seminorm and the smallest value of b for which this holdsis called themultiplier of p; if in addition p separates points then it is called a quasi-norm.On the other hand, if p satises the triangle inequality but in place of absolute homogeneity we require thatthen p is called a k-seminorm.We have the following relationship between quasi-seminorms and k-seminorms:

    Suppose that q is a quasi-seminorm on a vector space X with multiplier b. If 0 < k < log22 b then thereexists k-seminorm p on X equivalent to q.

    6.7 See also Normed vector space Asymmetric norm Matrix norm Gowers norm Mahalanobis distance Manhattan distance Relation of norms and metrics

  • 6.8. NOTES 33

    6.8 Notes[1] Prugoveki 1981, page 20

    [2] Except in R1, where it coincides with the Euclidean norm, and R0, where it is trivial.

    [3] Rolewicz, Stefan (1987), Functional analysis and control theory: Linear systems, Mathematics and its Applications (EastEuropean Series) 29 (Translated from the Polish by Ewa Bednarczuk ed.), Dordrecht; Warsaw: D. Reidel Publishing Co.;PWNPolish Scientic Publishers, pp. xvi,524, ISBN 90-277-2186-6, MR 920371, OCLC 13064804

    [4] Treves pp. 242243

    [5] Golub, Gene; Van Loan, Charles F. (1996). Matrix Computations (Third ed.). Baltimore: The Johns Hopkins UniversityPress. p. 53. ISBN 0-8018-5413-X.

    6.9 References Bourbaki, Nicolas (1987). Chapters 15. Topological vector spaces. Springer. ISBN 3-540-13627-4. Prugoveki, Eduard (1981). Quantum mechanics in Hilbert space (2nd ed.). Academic Press. p. 20. ISBN0-12-566060-X.

    Trves, Franois (1995). Topological Vector Spaces, Distributions and Kernels. Academic Press, Inc. pp.136149, 195201, 240252, 335390, 420433. ISBN 0-486-45352-9.

    Khaleelulla, S. M. (1982). Counterexamples in Topological Vector Spaces. Lecture Notes in Mathematics 936.Springer-Verlag. pp. 35. ISBN 978-3-540-11565-6. Zbl 0482.46002.

  • 34 CHAPTER 6. NORM (MATHEMATICS)

    x 1

    x 2

    x

    Illustrations of unit circles in dierent norms.

  • Chapter 7

    Normal basis

    In mathematics, a normal basis in eld theory is a special kind of basis for Galois extensions of nite degree,characterised as forming a single orbit for the Galois group. The normal basis theorem states that any nite Galoisextension of elds has a normal basis. In algebraic number theory the study of the more rened question of theexistence of a normal integral basis is part of Galois module theory.In the case of nite elds, this means that each of the basis elements is related to any one of them by applying theFrobenius p-th power mapping repeatedly, where p is the characteristic of the eld. Let GF(pm) be a eld with pmelements, and an element of it such that the m elements

    f; p; p2 ; : : : ; pm1g

    are linearly independent. Then this set forms a normal basis for GF(pm) over GF(p).

    7.1 Usage

    This basis is frequently used in cryptographic applications that are based on the discrete logarithm problem suchas elliptic curve cryptography. Hardware implementations of normal basis arithmetic typically have far less powerconsumption than other bases.When representing elements as a binary string (e.g. in GF(23) the most signicant bit represents 22=4, the middlebit represents 21=2, and the least signicant bit represents 20=), we can square elements by doing a left circularshift (left shifting 4 would give 8, but since we are working in GF(23) this wraps around to ). This makes thenormal basis especially attractive for cryptosystems that utilize frequent squaring.

    7.2 Primitive normal basis

    A primitive normal basis of an extension of nite elds E/F is a normal basis for E/F which is generated by aprimitive element of E. Lenstra and Schoof (1987) proved that every nite eld extension possesses a primitivenormal basis, the case when F is a prime eld having been settled by Harold Davenport.

    7.3 Free elements

    If E/F is a Galois extension with group G and x in E generates a normal basis then x is free in E/F. If x has theproperty that for every subgroup H of G, with xed eld H, x is free for E/H, then x is said to be completely