Singular Value Decomposition
Intution and logic behind SVD
Singular Value Decomposition (SVD)
Before talking about SVD, let us discuss the root cause why we wanted one like this.
In linear algebra, people have some high beauty standards. The love for square matrices. These were some demigods to them. They feel that they have so many good properties. But due to the extreme love towards these kind of matrices, the rectangular matrices were left behind.
Let us discuss one of the main thing that people wanted to deal even with rectangular matrices. The diagonalization of a matrix. This is done in square matrices and it is a decomposition of a square matrix into 3 different matrices
This kind of decomposition helps people in a lot of ways and is considered good. One of the example is and many other things which is not in the scope of this text.
So, thus came the idea of SVD, as the diagonalization is the thing only for square matrices. In rectangular matrices, we can't do this kind of decomposition. Hence we sought of a different technique.
Instead of decomposing the matrix we decompose the matrix and find things out for this. Even finding a determinant for a rectangular matrix uses this technique and is called a pseudo inverse. So, let us deal with the new thing.
Before going into the SVD, I shall write some of the questions I have got while reading the text on SVD in the book Foundations of Data Science by Blum.
- If there is a matrix of d rows but the rank is k given . Then what is the best subspace for which the matrix can be shrinked to?
To know the answer, we shall first go through the situations where this comes helpful in real life. Let us taake the example of Machine learning. In this field, we typically deal with huge datasets which are often of low rank than the full rank of the matrix. In these situations, we prefer to fit the matrix into a subspace of lower dimensions which would eventually help us in doing the math over it. Great, it seems well enough. Then how?
The best fit subspace
Given that we have a matrix , and we want to find the best fit subspace, we denote it with the following (calm down, I shall dissect each of them).
let us say that the rows in are and the rows of are . The entries in the matrix are on the principal diagonal and is a diagonal matrix. We define each of the rows in each matrix soon.
Let us say we have got some "kind of" eigen values and eigen vectors for the matrix . So, we follow the following procedure.
We call this as a singular value of the matrix and the to be singular vector. The main difference between a singular value and an eigen value is that a singular value is always positive and an eigen value can be positive as well as negative. Now we shall come back to the point of finding the best fit subspace. We say that, the vectors are the orthonormal vectors that shall span the subspace and we need to find them out. Initially, we shall go by the basic method.
Let us say that we have points in dimensional space and the points are in rows of the matrix . The matrix is usually of low rank when we speak about these datasets. So, we plot them on the -dimensional space and figure out that one single vector which has the least perpendicular distance or the maximum projection of those points on to it. We shall call it the top singular vector. How shall we figure it out? It is that vector for which we have
We call this as a top singular vector. and the length of this vector is the top singular value of .
That would have been great if all the points lie over the line, but it generally is not the case, i.e., there can be a case, a little amount of points are not aligned on the line and we would now need a subspace. The question is how to find such subspace? It is finding a vector perpendicular to and the subspace contains along with all the points in the subspace. Now, we have all the points on a single plane. This makes the perpendicular distance from the positions of the points to the subspace to be zero. How do we know when to stop entertaining this kind of iteration? When we continue the process of finding a new vector and then checking the projection, if we find out that the projection on to the subspace is zero, then we shall find that we need to stop and no more new vectors are entertained.
Before defining Frobenius Norm, let us do some calculations. Let us say each row of matrix is and the vectors that span the whole rows of are then
since the value of is the projection onto the subspace generated by the vectors it is just the value of the magnitude squared of the row itself. When we speak of the whole matrix itself, it becomes as
But