{"title": "Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 2364, "page_last": 2372, "abstract": "We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods.", "full_text": "Error-Minimizing Estimates and Universal\n\nEntry-Wise Error Bounds for Low-Rank Matrix\n\nCompletion\n\nFranz J. Kir\u00b4aly\u21e4\n\nDepartment of Statistical Science and\n\nCentre for Inverse Problems\nUniversity College London\nf.kiraly@ucl.ac.uk\n\nLouis Theran\u2020\n\nInstitute of Mathematics\nDiscrete Geometry Group\nFreie Universit\u00a8at Berlin\n\ntheran@math.fu-berlin.de\n\nAbstract\n\nWe propose a general framework for reconstructing and denoising single entries\nof incomplete and noisy entries. We describe: effective algorithms for deciding\nif and entry can be reconstructed and, if so, for reconstructing and denoising it;\nand a priori bounds on the error of each entry, individually. In the noiseless case\nour algorithm is exact. For rank-one matrices, the new algorithm is fast, admits\na highly-parallel implementation, and produces an error minimizing estimate that\nis qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and\nOptSpace methods.\n\n1\n\nIntroduction\n\nMatrix Completion is the task to reconstruct low-rank matrices from a subset of its entries and\noccurs naturally in many practically relevant problems, such as missing feature imputation, multi-\ntask learning [2], transductive learning [4], or collaborative \ufb01ltering and link prediction [1, 8, 9].\nAlmost all known methods performing matrix completion are optimization methods such as the\nmax-norm and nuclear norm heuristics [3, 9, 10], or OptSpace [5], to name a few amongst many.\nThese methods have in common that in general: (a) they reconstruct the whole matrix; (b) error\nbounds are given for all of the matrix, not single entries; (c) theoretical guarantees are given based\non the sampling distribution of the observations. These properties are all problematic in scenarios\nwhere: (i) one is interested only in predicting or imputing a speci\ufb01c set of entries; (ii) the entire data\nset is unwieldy to work with; (iii) or there are non-random \u201choles\u201d in the observations. All of these\npossibilities are very natural for the typical \u201cbig data\u201d setup.\nThe recent results of [6] suggest that a method capable of handling challenges (i)\u2013(iii) is within\nreach. By analyzing the algebraic-combinatorial structure Matrix Completion, the authors provide\nalgorithms that identify, for any \ufb01xed set of observations, exactly the entries that can be, in principle,\nreconstructed from them. Moreover, the theory developed indicates that, when a missing entry can\nbe determined, it can be found by \ufb01rst exposing combinatorially-determined polynomial relations\nbetween the known entries and the unknown ones and then selecting a common solution.\nTo bridge the gap between the theory of [6] and practice are the following challenges: to ef\ufb01ciently\n\ufb01nd the relevant polynomial relations; and to extend the methodology to the noisy case. In this\npaper, we show how to do both of these things in the case of rank one, and discuss how to instantiate\nthe same scheme for general rank. It will turn out that \ufb01nding the right set of polynomials and\n\n\u21e4Supported by the Mathematisches Forschungsinstitut Oberwolfach\n\u2020Supported by the European Research Council under the European Union\u2019s Seventh Framework Programme\n\n(FP7/2007-2013) / ERC grant agreement no 247029-SDModels.\n\n1\n\n\fnoisy estimation are intimately related: we can treat each polynomial as providing an estimate of\nthe missing entry, and we can then take as our estimate the variance minimizing weighted average.\nThis technique also gives a priori lower bounds for a broad class of unbiased single-entry estimators\nin terms of the combinatorial structure of the observations and the noise model only. In detail, our\ncontributions include:\n\nof a rank-one-matrix, under the assumption of known noise variances\n\n\u2022 the construction of a variance-minimal and unbiased estimator for any \ufb01xed missing entry\n\u2022 an explicit form for the variance of that estimator, being a lower bound for the variance of\nany unbiased estimation of any \ufb01xed missing entry and thus yielding a quantiative measure\non the trustability of that entry reconstructed from any algorithm\n\u2022 the description of a strategy to generalize the above to any rank\n\u2022 comparison of the estimator with two state-of-the-art optimization algorithms (OptSpace\nand nuclear norm), and error assessment of the three matrix completion methods with the\nvariance bound\n\nAs mentioned, the restriction to rank one is not inherent in the overall scheme. We depend on rank\none only in the sense that we understand the combinatorial-algebraic structure of rank-one-matrix\ncompletion exactly, whereas the behavior in higher rank is not yet as well understood. Nonetheless,\nit is, in principle accessible, and, once available will can be \u201cplugged in\u201d to the results here without\nchanging the complexity much. In this sense, the present paper is a proof-of-concept for a new\napproach to estimating and denoising in algebraic settings based on combinatorially enumerating a\nset of polynomial estimators and then averaging them. For us, computational ef\ufb01ciency comes via\na connection to the topology of graphs that is speci\ufb01c to this problem, but we suspect that this part,\ntoo, can be generalized somewhat.\n\n2 The Algebraic Combinatorics of Matrix Completion\n\nWe \ufb01rst brie\ufb02y review facts about Matrix Completion that we require. The exposition is along the\nlines of [6].\nDe\ufb01nition 2.1. A matrix M 2 {0, 1}m\u21e5n is called a mask. If A is a partially known matrix, then\nthe mask of A is the mask which has ones in exactly the positions which are known in A and zeros\notherwise.\nDe\ufb01nition 2.2. Let M be an (m\u21e5n) mask. We will call the unique bipartite graph G(M ) which has\nM as bipartite adjacency matrix the completion graph of M. We will refer to the m vertices of G(M )\ncorresponding to the rows of M as blue vertices, and to the n vertices of G(M ) corresponding to\nthe columns as red vertices. If e = (i, j) is an edge in Km,n (where Km,n is the complete bipartite\ngraph with m blue and n red vertices), we will also write Ae instead of Aij and for any (m \u21e5 n)\nmatrix A.\n\nA fundamental result, [6, Theorem 2.3.5], says that identi\ufb01ability and reconstructability are, up to a\nnull set, graph properties.\nTheorem 2.3. Let A be a generic1 and partially known (m\u21e5 n) matrix of rank r, let M be the mask\nof A, let i, j be integers. Whether Aij is reconstructible (uniquely, or up to \ufb01nite choice) depends\nonly on M and the true rank r; in particular, it does not depend on the true A.\n\nFor rank one, as opposed to higher rank, the set of reconstructible entries is easily obtainable from\nG(M ) by combinatorial means:\nTheorem 2.4 ([6, Theorem 2.5.36 (i)]). Let G \u2713 Km,n be the completion graph of a partially\nknown (m \u21e5 n) matrix A. Then the set of uniquely reconstructible entries of A is exactly the set\nAe, with e in the transitive closure of G. In particular, all of A is reconstructible if and only if G is\nconnected.\n\n1In particular, if A is sampled from a continuous density, then the set of non-generic A is a null set.\n\n2\n\n\f2.1 Reconstruction on the transitive closure\n\nWe extend Theorem 2.4\u2019s theoretical reconstruction guarantee by describing an explicit, algebraic\nalgorithm for actually doing the reconstruction.\nDe\ufb01nition 2.5. Let P \u2713 Km,n (or, C \u2713 Km,n) be a path (or, cycle), with a \ufb01xed start and end. We\nwill denote by E+(P ) be the set of edges in P (resp. E+(C) and C) traversed from blue vertex to\na red one, and by E(P ) the set of edges traversed from a red vertex to a blue one 2. From now\non, when we speak of \u201coriented paths\u201d or \u201coriented cycles\u201d, we mean with this sign convention and\nsome \ufb01xed traversal order.\nLet A = Aij be a (m \u21e5 n) matrix of rank 1, and identify the entries Aij with the edges of Km,n.\nFor an oriented cycle C, we de\ufb01ne the polynomials\n\nPC(A) = Ye2E+(C)\nLC(A) = Xe2E+(C)\n\nAe,\n\nAe Ye2E(C)\nlog Ae Xe2E(C)\n\nand\n\nlog Ae,\n\nwhere for negative entries of A, we \ufb01x a branch of the complex logarithm.\nTheorem 2.6. Let A = Aij be a generic (m \u21e5 n) matrix of rank 1. Let C \u2713 Km,n be an oriented\ncycle. Then, PC(A) = LC(A) = 0.\nProof: The determinantal ideal of rank one is a binomial ideal generated by the (2 \u21e5 2) minors of A\n(where entries of A are considered as variables). The minor equations are exactly PC(A), where C\nis an elementary oriented four-cycle; if C is an elementary 4-cycle, denote its edges by a(C), b(C),\nc(C), d(C), with E+(C) = {a(C), d(C)}. Let C be the collection of the elementary 4-cycles, and\nde\ufb01ne LC(A) = {LC(A) : C 2 C} and PC(A) = {PC(A) : C 2 C}.\nBy sending the term log Ae to a formal variable xe, we see that the free Z-group generated by the\nLC(A) is isomorphic to H1(Km,n, Z). With this equivalence, it is straightforward that, for any\noriented cycle D, LD(A) lies in the Z-span of elements of LC(A) and, therefore, formally,\n\nLD(A) =XC2C\n\n\u21b5C \u00b7 LC(A)\n\nwith the \u21b5C 2 Z. Thus LD(\u00b7) vanishes when A is rank one, since the r.h.s. does. Exponentiating\n\u21e4\ncompletes the proof.\nCorollary 2.7. Let A = Aij be a (m \u21e5 n) matrix of rank 1. Let v, w be two vertices in Km,n. Let\nP, Q be two oriented paths in Km,n starting at v and ending at w. Then, for all A, it holds that\nLP (A) = LQ(A).\n\n3 A Combinatorial Algebraic Estimate for Missing Entries and Their Error\n\nWe now construct our estimator.\n\n3.1 The sampling model\n\nIn all of the following, we will assume that the observations arise from the following sampling\nprocess:\nAssumption 3.1. There is an unknown \ufb01xed, rank one, matrix A which is generic, and an (m \u21e5 n)\nmask M 2 {0, 1}m\u21e5n which is known. There is a (stochastic) noise matrix E 2 Rm\u21e5n whose entries\nare uncorrelated and which is multiplicatively centered with \ufb01nite variance, non-zero3 variance; i.e.,\nE(log Eij) = 0 and 0 < Var(log Eij) < 1 for all i and j.\nThe observed data is the matrix A M E = \u2326(A E), where denotes the Hadamard (i.e.,\ncomponent-wise) product. That is, the observation is a matrix with entries Aij \u00b7 Mij \u00b7 Eij.\n\n2Any \ufb01xed orientation of Km,n will give us the same result.\n3The zero-variance case corresponds to exact reconstruction, which is handled already by Theorem 2.4.\n\n3\n\n\fThe assumption of multiplicative noise is a necessary precaution in order for the presented estimator\n(and in fact, any estimator) for the missing entries to have bounded variance, as shown in Exam-\nple 3.2 below. This is not, in practice, a restriction since an in\ufb01nitesimal additive error Aij on an\nentry of A is equivalent to an in\ufb01nitesimal multiplicative error log Aij = Aij/Aij, and additive\nvariances can be directly translated into multiplicative variances if the density function for the noise\nis known4. The previous observation implies that the multiplicative noise model is as powerful as\nany additive one that allows bounded variance estimates.\nExample 3.2. Consider a (2 \u21e5 2)-matrix A of rank 1. The unique equation between the entries\nis then A11A22 = A12A21. Solving for any entry will have another entry in the denominator, for\nexample A11 = A12A21\n. Thus we get an estimator for A11 when substituting observed and noisy\nentries for A12, A21, A22. When A22 approaches zero, the estimation error for A11 approaches\nin\ufb01nity. In particular, if the density function of the error E22 of A22 is too dense around the value\nA22, then the estimate for A11 given by the equation will have unbounded variance. In such a\ncase, one can show that no estimator for A11 has bounded variance.\n\nA22\n\n3.2 Estimating entries and error bounds\n\nIn this section, we construct the unbiased estimator for the entries of a rank-one-matrix with minimal\nvariance. First, we de\ufb01ne some notation to ease the exposition:\nNotations 3.3. We will denote by aij = log Aij and \"ij = log Eij the logarithmic entries and noise.\nThus, for some path P in Km,n we obtain\n\nLP (A) = Xe2E+(P )\n\nae Xe2E(P )\n\nae.\n\nDenote by bij = aij + \"ij the logarithmic (observed) entries, and B the (incomplete) matrix which\nhas the (observed) bij as entries. Denote by ij = Var(bij) = Var(\"ij).\n\nThe components of the estimator will be built from the LP :\nLemma 3.4. Let G = G(M ) be the graph of the mask M. Let x = (v, w) 2 Km,n be any edge\nwith v red. Let P be an oriented path in G(M ) starting at v and ending at w. Then,\n\nLP (B) = Xe2E+(P )\n\nbe Xe2E(P )\n\nbe\n\nis an unbiased estimator for ax with variance Var(LP (B)) =Pe2P e.\n\nProof: By linearity of expectation and centeredness of \"ij, it follows that\nE(be),\n\nE(LP (B)) = Xe2E+(P )\nVar(LP (B)) = Xe2E+(P )\n\nE(be) Xe2E(P )\nVar(be) + Xe2E(P )\n\nVar(be),\n\nand the statement follows from the de\ufb01nition of e.\n\nthus LP (B) is unbiased. Since the \"e are uncorrelated, the be also are; thus, by Bienaym\u00b4e\u2019s formula,\nwe obtain\n\nIn the following, we will consider the following parametric estimator as a candidate for estimating\nae:\nNotations 3.5. Fix an edge x = (v, w) 2 Km,n. Let P be a basis for the v\u2013w path space and\ndenote #P by p. For \u21b5 2 Rp, set X(\u21b5) =PP2P \u21b5P LP (B).\nFurthermore, we will denote by\n\nthe n-vector of ones.\n\n4The multiplicative noise assumption causes the observed entries and the true entries to have the same sign.\nThe change of sign can be modeled by adding another multiplicative binary random variable in the model which\ntakes values \u00b11; this adds an independent combinatorial problem for the estimation of the sign which can be\ndone by maximum likelihood. In order to keep the exposition short and easy, we did not include this into the\nexposition.\n\n4\n\n\fThe following Lemma follows immediately from Lemma 3.4 and Theorem 2.6:\nLemma 3.6. E(X(\u21b5)) = >\u21b5 \u00b7 bx; in particular, X(\u21b5) is an unbiased estimator for bx if and only\nif >\u21b5 = 1.\n\nWe will now show that minimizing the variance of X(\u21b5) can be formulated as a quadratic program\nwith coef\ufb01cients entirely determined by ax, the measurements be and the graph G(M ). In particular,\nwe will expose an explicit formula for the \u21b5 minimizing the variance. The formula will make use of\nthe following path kernel. For \ufb01xed vertices s and t, an s\u2013t path is the sum of a cycle H1(G, Z) and\nast. The s\u2013t path space is the linear span of all the s\u2013t paths. We discuss its relevant properties in\nAppendix A.\nDe\ufb01nition 3.7. Let e 2 Km,n be an edge. For an edge e and a path P , set ce,P = \u00b11 if e 2 E\u00b1(P )\notherwise ce,P = 0. Let P, Q 2 P be any \ufb01xed oriented paths. De\ufb01ne the (weighted) path kernel\nk : P \u21e5 P ! R by\n\nk(P, Q) = Xe2Km,n\n\nce,P \u00b7 ce,Q \u00b7 e.\n\nUnder our assumption that Var(be) > 0 for all e 2 Km,n, the path kernel is positive de\ufb01nite, since\nit is a sum of p independent positive semi-de\ufb01nite functions; in particular, its kernel matrix has full\nrank. Here is the variance-minimizing unbiased estimator:\nProposition 3.8. Let x = (s, t) be a pair of vertices, and P a basis for the s\u2013t path space in G with\np elements. Let \u2303 be the p \u21e5 p kernel matrix of the path kernel with respect to the basis P. For any\n\u21b5 2 Rp, it holds that Var(X(\u21b5)) = \u21b5>\u2303\u21b5. Moreover, under the condition >\u21b5 = 1, the variance\nVar(X(\u21b5)) is minimized by \u21b5 =\u23031 >\u23031 1 .\n\u21b5P LP (B) = XP2P\n\nProof: By inserting de\ufb01nitions, we obtain\n\nX(\u21b5) = XP2P\n\n\u21b5P Xe2Km,n\n\nce,P be.\n\nWriting b = (be) 2 Rmn as vectors, and C = (ce,P ) 2 Rp\u21e5mn as matrices, we obtain X(\u21b5) =\nb>C\u21b5. By using that Var(\u00b7) = 2 Var(\u00b7) for any scalar , and independence of the be, a calculation\nyields Var(X(\u21b5)) = \u21b5>\u2303\u21b5. In order to determine the minimum of the variance in \u21b5, consider the\nLagrangian\n\nL(\u21b5, ) = \u21b5>\u2303\u21b5 + 1 XP2P\n\n\u21b5P! ,\n\nwhere the slack term models the condition >\u21b5 = 1. An straightforward computation yields\n\n@L\n@\u21b5\n\n= 2\u2303\u21b5 \n\nDue to positive de\ufb01niteness of \u2303 the function Var(X(\u21b5)) is convex, thus \u21b5 = \u23031 / >\u23031 will\n\u21e4\nbe the unique \u21b5 minimizing the variance while satisfying >\u21b5 = 1.\nRemark 3.9. The above setup works in wider generality: (i) if Var(be) = 0 is allowed and there is\nan s\u2013t path of all zero variance edges, the path kernel becomes positive semi-de\ufb01nite; (ii) similarly\nif P is replaced with any set of paths at all, the same may occur. In both cases, we may replace\n\u23031 with the Moore-Penrose pseudo-inverse and the proposition still holds: (i) reduces to the exact\nreconstruction case of Theorem 2.4; (ii) produces the optimal estimator with respect to P, which is\noptimal provided that P is spanning, and adding paths to P does not make the estimate worse.\nOur estimator is optimal over a fairly large class.\nTheorem 3.10. Let bAij be any estimator for an entry Aij of the true matrix that is: (i) unbiased; (ii)\nLet A\u21e4ij be the estimator from Proposition 3.8. Then Var(A\u21e4ij) \uf8ff Var(bAij).\n\nWe give a complete proof in the full version. Here, we prove the special case of log-normal noise,\nwhich gives an alternate viewpoint on the path kernel.\n\na deterministic piecewise smooth function of the observations; (iii) independent of the noise model.\n\n5\n\n\fProof: As above, we work with the formal logarithm aij of Aij. For log-normal noise, the \"e are\nindependently distributed normals with variance e. It then follows that, for any P in the i\u2013j path\nspace,\n\nLP (B) \u21e0 N aij,Xe2P\n\ne!\n\nand the kernel matrix \u2303 of the path kernel is the covariance matrix for the LP in our path basis.\nThus, the LP have distribution N (aij\n, \u2303). It is well-known that any multivariate normal has a\nlinear repreameterization so that the coordinates are independent; a computation shows that, here,\n\n\u23031 >\u23031 1 is the correct linear map. Thus, the estimator A\u21e4ij is the sample mean of the\n\ncoordinates in the new parameterization. Since this is a suf\ufb01cient statistic, we are done via the\n\u21e4\nLehmann\u2013Scheff\u00b4e Theorem.\n\n3.3 Rank 2 and higher\n\nAn estimator for rank 2 and higher, together with a variance analysis, can be constructed similarly\nonce all the solving polynomials are known. The main dif\ufb01culties lies in the fact that these polyno-\nmials are not parameterized by cycles anymore, but speci\ufb01c subgraphs of G(M ), see [6, Section 2.5]\nand that they are not necessarily linear in the missing entry Ae. However, even with approximate\noracles for evaluating these polynomials and estimating their covariances, an estimator similar to\nX(\u21b5) can be constructed and analyzed; in particular, we still need only to consider a basis for the\nspace of \u201ccircuits\u201d through the missing entry and not a costly brute force enumeration.\n\n3.4 The algorithms\n\nWe now give the algorithms for estimating/denoising entries and computing the variance bounds;\nan implementation is available from [7]. Since the the path matrix C, the path kernel matrix \u2303,\nand the optimal \u21b5 are required for both, we show how to compute them \ufb01rst. We can \ufb01nd a basis\n\nAlgorithm 1 Calculates path kernel \u2303 and \u21b5.\nInput: index (i, j), an (m \u21e5 n) mask M, variances .\nOutput: path matrix C, path kernel \u2303 and minimizer \u21b5.\n1: Find a linearly independent set of paths P in the graph G(M ), starting from i and ending at j.\n2: Determine the matrix C = (ce,P ) with e 2 G(M ), P 2 P; set ce,P = \u00b11 if e 2 E\u00b1(P ),\n3: De\ufb01ne a diagonal matrix S = diag(), with See = e for e 2 G(M ).\n4: Compute the kernel matrix \u2303 = C>SC.\n5: Calculate \u21b5 =\u23031 >\u23031 1 .\n\n6: Output C, \u2303 and \u21b5.\n\notherwise ce,P = 0.\n\nfor the path space in linear time. To keep the notation manageable, we will con\ufb02ate formal sums\nof the xe, cycles in H1(G, Z) and their representations as vectors in Rm. Correctness is proven in\nAppendix A.\n\nOutput ;.\n\nAlgorithm 2 Calculates a basis P of the path space.\nInput: index (i, j), an (m \u21e5 n) mask M.\nOutput: a basis P for the space of oriented i\u2013j paths.\n1: If (i, j) is not an edge of M, and i and j are in different connected components, then P is empty.\n2: Otherwise, if (i, j) is not an edge, of M, add a \u201cdummy\u201d copy.\n3: Compute a spanning forest F of M that does not contain (i, j), if possible.\n4: For each edge e 2 M \\ F , compute the fundamental cycle Ce of e in F .\n5: If (i, j) is an edge in M, output {x(i,j)} [ {Ce x(i,j) : e 2 M \\ F}.\n6: Otherwise, let P(i,j) = C(i,j) x(i,j). Output {Ce P(i,j) : e 2 M \\ (F [ {(i, j)})}.\n\n6\n\n\fAlgorithms 3 and 4 then can make use of the calculated C, \u21b5, \u2303 to determine an estimate for any\nentry Aij and its minimum variance bound. The algorithms follow the exposition in Section 3.2,\nfrom where correctness follows; Algorithm 3 additionally provides treatment for the sign of the\nentries.\n\nAlgorithm 3 Estimates the entry aij.\nInput: index (i, j), an (m \u21e5 n) mask M, log-variances , the partially observed and noisy matrix\nB.\nOutput: The variance-minimizing estimate for Aij.\n1: Calculate C and \u21b5 with Algorithm 1.\n2: Store B as a vector b = (log |Be|) and a sign vector s = (sgn Be) with e 2 G(M ).\n3: Calculate bAij = \u00b1 expb>C\u21b5 . The sign is + if each column of s>|C| (|.| component-wise)\ncontains an odd number of entries 1, else .\n4: Return bAij.\nAlgorithm 4 Determines the variance of the entry log(Aij).\nInput: index (i, j), an (m \u21e5 n) mask M, log-variances .\nOutput: The variance lower bound for log(Aij).\n1: Calculate \u2303 and \u21b5 with Algorithm 1.\n2: Return \u21b5>\u2303\u21b5.\n\nAlgorithm 4 can be used to obtain the variance bound independently of the observations. The\nvariance bound is relative, due to its multiplicativity, and can be used to approximate absolute\nbounds when any (in particular not necessarily the one from Algorithm 3) reconstruction estimate\n\nbAij is available. Namely, if bij is the estimated variance of the logarithm, we obtain an upper\ncon\ufb01dence/deviation bound bAij \u00b7 exppbij for bAij, and a lower con\ufb01dence/deviation bound\nbAij \u00b7 exppbij, corresponding to the log-con\ufb01dence log bAij \u00b1pbij. Also note that if Aij\n\nis not reconstructible from the mask M, then the deviation bounds will be in\ufb01nite.\n\n4 Experiments\n\n4.1 Universal error estimates\n\nFor three different masks, we calculated the predicted minimum variance for each entry of the mask.\nThe mask sizes are all 140\u21e5 140. The multiplicative noise was assumed to be e = 1 for each entry.\nFigure 1 shows the predicted a-priori minimum variances for each of the masks. The structure of\nthe mask affects the expected error. Known entries generally have least variance, and it is less than\nthe initial variance of 1, which implies that the (independent) estimates coming from other paths can\nbe used to successfully denoise observed data. For unknown entries, the structure of the mask is\nmirrored in the pattern of the predicted errors; a diffuse mask gives a similar error on each missing\nentry, while the more structured masks have structured error which is determined by combinatorial\nproperties of the completion graph.\n\n \n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n \n\n \n\n \n\n1.1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n \n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n \n\nFigure 1: The \ufb01gure shows three pairs of masks and predicted variances. A pair consists of two\nadjacent squares. The left half is the mask which is depicted by red/blue heatmap with red entries\nknown and blue unknown. The right half is a multicolor heatmap with color scale, showing the pre-\ndicted variance of the completion. Variances were calculated by our implementation of Algorithm 4.\n\n7\n\n\fE\nS\nM\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n \n\nPath Kernel\nNN\nOptSpace\n\n \n\nr\no\nr\nr\ne\n\n \n\nd\ne\nv\nr\ne\ns\nb\no\nn\na\ne\nm\ne\n\n \n\n \n\n \n\nPathKernel\nNN\nOptSpace\n\n6\n\n5\n\n4\n\n3\n\n0\n\n.1\n\n.2\n\n.3\n\n.4\n\n.5\n\nNoise Level\n\n.6\n\n.7\n\n.8\n\n.9\n\nl\ni\nt\n\nn\na\nu\nq\n\n2\n\n1\n\n \n0\n\u22120.2\n\n0\n\n0.2\n\n0.4\n0.6\npredicted variance\n\n0.8\n\n1\n\n1.2\n\n(a) mean squared errors\n\n(b) error vs. predicted variance\n\nFigure 2: For 10 randomly chosen masks and 50 \u21e5 50 true matrix, matrix completions were per-\nformed with Nuclear Norm (green), OptSpace (red), and Algorithm 3 (blue) under multiplicative\nnoise with variance increasing in increments of 0.1. For each completed entry, minimum variances\nwere predicted by Algorithm 4. 2(a) shows the mean squared error of the three algorithms for each\nnoise level, coded by the algorithms\u2019 respective colors. 2(b) shows a bin-plot of errors (y-axis) ver-\nsus predicted variances (x-axis) for each of the three algorithms: for each completed entry, a pair\n(predicted error, true error) was calculated, predicted error being the predicted variance, and the\nactual prediction error being the squared logarithmic error (i.e., (log |atrue| log |apredicted|)2 for\nan entry a). Then, the points were binned into 11 bins with equal numbers of points. The \ufb01gure\nshows the mean of the errors (second coordinate) of the value pairs with predicted variance (\ufb01rst\ncoordinate) in each of the bins, the color corresponds to the particular algorithm; each group of bars\nis centered on the minimum value of the associated bin.\n\nIn\ufb02uence of noise level\n\n4.2\nWe generated 10 random mask of size 50 \u21e5 50 with 200 entries sampled uniformly and a random\n(50 \u21e5 50) matrix of rank one. The multiplicative noise was chosen entry-wise independent, with\nvariance i = (i 1)/10 for each entry. Figure 2(a) compares the Mean Squared Error (MSE) for\nthree algorithms: Nuclear Norm (using the implementation Tomioka et al. [10]), OptSpace [5], and\nAlgorithm 3. It can be seen that on this particular mask, Algorithm 3 is competitive with the other\nmethods and even outperforms them for low noise.\n\n4.3 Prediction of estimation errors\n\nThe data are the same as in Section 4.2, as are the compared algorithm. Figure 2(b) compares the\nerror of each of the methods with the variance predicted by Algorithm 4 each time the noise level\nchanged. The \ufb01gure shows that for any of the algorithms, the mean of the actual error increases\nwith the predicted error, showing that the error estimate is useful for a-priori prediction of the actual\nerror - independently of the particular algorithm. Note that by construction of the data this statement\nholds in particular for entry-wise predictions. Furthermore, in quantitative comparison Algorithm 4\nalso outperforms the other two in each of the bins. The qualitative reversal between the algorithms\nin Figures 2(b) (a) and (b) comes from the different error measure and the conditioning on the bins.\n\n5 Conclusion\n\nIn this paper, we have introduced an algebraic combinatorics based method for reconstructing and\ndenoising single entries of an incomplete and noisy matrix, and for calculating con\ufb01dence bounds\nof single entry estimations for arbitrary algorithms. We have evaluated these methods against state-\nof-the art matrix completion methods. Our method is competitive and yields the \ufb01rst known a\npriori variance bounds for reconstruction. These bounds coarsely predict the performance of all the\nmethods. Furthermore, our method can reconstruct and estimate the error for single entries. It can\nbe restricted to using only a small number of nearby observations and smoothly improves as more\ninformation is added, making it attractive for applications on large scale data. These results are an\ninstance of a general algebraic-combinatorial scheme and viewpoint that we argue is crucial for the\nfuture understanding and practical treatment of big data.\n\n8\n\n\fReferences\n[1] E. Acar, D. Dunlavy, and T. Kolda. Link prediction on evolving data using matrix and tensor factoriza-\ntions. In Data Mining Workshops, 2009. ICDMW\u201909. IEEE International Conference on, pages 262\u2013269.\nIEEE, 2009.\n\n[2] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying. A spectral regularization framework for multi-task\nstructure learning. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in NIPS 20, pages\n25\u201332. MIT Press, Cambridge, MA, 2008.\n\n[3] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math.,\n9(6):717\u2013772, 2009. ISSN 1615-3375. doi: 10.1007/s10208-009-9045-5. URL http://dx.doi.\norg/10.1007/s10208-009-9045-5.\n\n[4] A. Goldberg, X. Zhu, B. Recht, J. Xu, and R. Nowak. Transduction with matrix completion: Three birds\nIn J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors,\n\nwith one stone.\nAdvances in Neural Information Processing Systems 23, pages 757\u2013765. 2010.\n\n[5] R. H. Keshavan, A. Montanari, and S. Oh. Matrix completion from a few entries. IEEE Trans. Inform.\nISSN 0018-9448. doi: 10.1109/TIT.2010.2046205. URL http:\n\nTheory, 56(6):2980\u20132998, 2010.\n//dx.doi.org/10.1109/TIT.2010.2046205.\n\n[6] F. J. Kir\u00b4aly, L. Theran, R. Tomioka, and T. Uno. The algebraic combinatorial approach for low-rank matrix\n\ncompletion. Preprint, arXiv:1211.4116v4, 2012. URL http://arxiv.org/abs/1211.4116.\n[7] F. J. Kir\u00b4aly and L. Theran. AlCoCoMa, 2013. http://mloss.org/software/view/524/.\n[8] A. Menon and C. Elkan. Link prediction via matrix factorization. Machine Learning and Knowledge\n\nDiscovery in Databases, pages 437\u2013452, 2011.\n\n[9] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In L. K. Saul,\nY. Weiss, and L. Bottou, editors, Advances in NIPS 17, pages 1329\u20131336. MIT Press, Cambridge, MA,\n2005.\n\n[10] R. Tomioka, K. Hayashi, and H. Kashima. On the extension of trace norm to tensors. In NIPS Workshop\n\non Tensors, Kernels, and Machine Learning, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1129, "authors": [{"given_name": "Franz", "family_name": "Kiraly", "institution": "TU Berlin"}, {"given_name": "Louis", "family_name": "Theran", "institution": "Freie Universit\u00e4t Berlin"}]}