Toward Deeper Understanding of Neural Networks: The ? Toward Deeper Understanding of Neural Networks:

  • Published on
    02-Jul-2018

  • View
    212

  • Download
    0

Transcript

Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View onExpressivityAmit Daniely Roy Frostig Yoram SingerGoogle ResearchNIPS, 2017Presenter: Chao JiangAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 1 /34Outline1 IntroductionOverview2 ReviewReproducing Kernel Hilbert SpaceDuality3 Terminology and Main ConclusionAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 2 /34Outline1 IntroductionOverview2 ReviewReproducing Kernel Hilbert SpaceDuality3 Terminology and Main ConclusionAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 3 /34OverviewThey define an object termed a computation skeleton that describes adistilled structure of feed-forward networks.They show that the representation generated by random initializationis sufficiently rich to approximately express the functions in Hall functions in H can be approximated by tuning the weights of thelast layer, which is a convex optimization task.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 4 /34Outline1 IntroductionOverview2 ReviewReproducing Kernel Hilbert SpaceDuality3 Terminology and Main ConclusionAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 5 /34Part I: Function BasisIn Rn space, we can use n independent vectors to represent anyvector by linear combination. The n independent vectors can beviewed as a set of basis.if x = (x1, x2, , xn), y = (y1, y2, , yn), we can get< x , y >=ni=1xiyiUntil now, this is the review of vector basis. These knowledge can beextended to functions and function space.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 6 /34Function BasisA function is an infinite vector.For a function defined on the interval [a, b], we take samples by aninterval x .If we sample the function f (x) at points a, x1 , xn, b, then we cantransform the function into a vector (f (a), f (x1), , f (xn), f (b))T .When x 0, the vector should be more and more close to thefunction and at last, it becomes infinite.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 7 /34Inner Product of Functions SimilarlySince functions are so close to vectors, we can also define the innerproduct of functions similarly.For two functions f and g sampling by interval x , the inner productcould be defined as:< f , g >= limx0if (xi )g(xi )x =f (x)g(x)dxAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 8 /34Inner Product of Functions SimilarlyThe expression of function inner product is seen everywhere. It hasvarious meanings in various context.For example, if X is a continuous random variable with probabilitydensity function f (x) , i.e., f (x) > 0 andf (x)dx = 1 , then theexpectationE [g(x)] =f (x)g(x)dx =< f , g >Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 9 /34Inner Product of Functions SimilarlySimilar to vector basis, we can use a set of functions to representother functions.The difference is that in a vector space, we only need finite vectors toconstruct a complete basis set, but in function space, we may needinfinite basis functions.Two functions can be regarded as orthogonal if their inner product iszero.In function space, we can also have a set of function basis that aremutually orthogonal.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 10 /34Inner Product of Functions SimilarlyKernel methods have been widely used in a variety of data analysistechniques.The motivation of kernel method arises in mapping a vector in Rnspace as another vector in a feature space.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 11 /34Eigen DecompositionFor a real symmetric matrix A , there exists real number and vectorq so thatAq = qFor A Rnn, we can find n eigenvalues (i ) along with n orthogonaleigenvectors (qi ). As a result, A can be decomposited asHere {qi}ni=1 is a set of orthogonal basis of Rn.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 12 /34Eigen DecompositionA matrix is a description of the transformation in a linear space.The transformation direction is eigenvector.The transformation scale is eigenvalue.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 13 /34Kernel FunctionA function f (x) can be viewed as an infinite vector, then for afunction with two independent variables K (x , y), we can view it as aninfinite matrix.If K (x , y) = K (y , x) and f (x)K (x , y)f (y)dxdy 0for any function f , then K (x , y) is symmetric and positive definite,then K (x , y) is a kernel function.Similar to matrix eigenvalue and eigenvector, there exists eigenvalue and eigenfunction (x), so thatK (x , y)(x)dx = (y)Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 14 /34Kernel FunctionK (x , y) =i=0ii (x)i (y) (1)Here, < i , j >= 0fori 6= jTherefore, {i}i=1 construct a set of orthogonal basis for a functionspace.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 15 /34Reproducing Kernel Hilbert Space{ii}i=1 is a set of orthogonal basis and construct a Hilbert spaceHAny function or vector in the space can be represented as the linearcombination of the basis. Supposef =i=1fiiiwe can denote f as an infinite vector in H:f = (f1, f2, )THfor another functiong = (g1, g2, )THwe have< f , g >H=i=1figiAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 16 /34Reproducing Kernel Hilbert SpaceWe use K (x , y) to denote the number of K at point x , y , use K (, )to denote the function itself, and use K (x , ) to denote the x-throw of the matrix.K (x , ) =i=0ii (x)iIn space H, we can denoteK (x , ) = (11(x),22(x), )THTherefore,< K (x , ,K (y , )) >H=i=0ii (x)i (y) = K (x , y)This is the reproducing property, thus H is called reproducing kernelHilbert space (RKHS).Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 17 /34Go Back: How to Map a Point into A Feature Spacewe define a mapping:(x) = K (x , ) = (11(x),22(x), )THthen we can map the point x to H.< (x),(y) >H=< K (x , ),K (y , ) >H= K (x , y)As a result, we do not need to actually know what is the mapping,where is the feature space, or what is the basis of the feature space.Kernel trick: for a symmetric positive-definite function K , there mustexist at least one mapping and one feature space H, so that< (x),(y) >H= K (x , y)Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 18 /34Outline1 IntroductionOverview2 ReviewReproducing Kernel Hilbert SpaceDuality3 Terminology and Main ConclusionAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 19 /34DualityIn mathematical optimization theory, duality means that optimizationproblems may be viewed from either of two perspectives, the primalproblem or the dual problem (the duality principle). The solution tothe dual problem provides a lower bound to the solution of the primal(minimization) problem.If you have a minimization problem, you can also see it as amaximization problem. And when you find the maximum of thisproblem, it will be a lower bound to the solution of the minimizationproblemAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 20 /34DualityWe want to minimize the function at the top of the graph. Its minimum isP, D is the maximum for its dual problem.P D is duality gap. ifP D > 0, we say weak dualityholds.P D = 0, there is no dualitygap, and we say that strongduality holds.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 21 /34NotationG = (V, E) is a directed acyclic graphThe set of neighbors incoming to a vertex v is denotedin(v) := u V |uv EThe d - 1 dimensional sphere is denoted Sd1 = {x Rd |x = 1}The ball of radius B in Hilbert space H: {x Hd |xH B}input: x = (x1, , xn),wherex i Sd1A network N with a weight kernel vector w = {Wuv |uv E} definesa predictor hN ,w : X RkFor an input node v , its output is hv ,w(x) = v (uin(v) wuvhu,w(X ))The representation induced by the weight w is RN ,w = hrep(N ),wAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 22 /34Computation SkeletonDefinition 1Definition 1. A computation skeleton S is directed acyclic graph (DAG)whose non-input nodes are labeled by activations.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 23 /34Fully connected layer of a skeletonTerminologyAn induced subgraph of a skeleton with r + 1 nodes, u1, , ur , v is calleda fully connected layer if ite edges are u1v , urvAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 24 /34Convolution layer of a skeletonAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 25 /34Realization of a skeletonAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 26 /34Realization of a skeletonThe (5, 4)-realization is a network with a single (one dimensional)convolutional layer having 5 channels, stride of 2, and width of 4,followed by three fully-connected layers.Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 27 /34Random WeightsAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 28 /34Dual activation kernelA computation skeleton S also defines a normalized kernelkS : X X [1, 1] and a corresponding norm S on functionsf : S RThis norm has the property that f S is small iff f can be obtainedby certain simple compositions of functions according to the structureof S.To define the kernel, we introduce a dual activation and dual kernel.For [1, 1], we denote by N the multivariate Gaussian distributionon R2 with mean 0 and covariance matrix(1 1)Amit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 29 /34Dual activation kernelAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 30 /34Dual activation kernelAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 31 /34Dual activation kernelAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 32 /34Dual activation kernelAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 33 /34Dual activation kernelAmit Daniely, Roy Frostig, Yoram Singer Toward Deeper Understanding of Neural Networks:The power of Initialization and a Dual View on ExpressivityNIPS, 2017 Presenter: Chao Jiang 34 /34IntroductionOverviewReviewReproducing Kernel Hilbert SpaceDualityTerminology and Main Conclusion

Recommended

View more >