Thursday, January 25, 2007

On making the sparsity parameter an implicit variable in the program

Currently the sparsity parameter k ( as in card(x) <> ) must be hand tuned when we run the solver. What if we turn this parameter into a variable in the program. I.e., we let the program "search" for the appropriate sparsity. Just something to think about.... I think it should be fairly easy to modify current program to do this.

just a thought...

I wonder how well we can apply machine translation techniques to heterogeneous data modeling. Something to look into when I got some time.

Wednesday, January 24, 2007

Post. Response.

I agree that our paper should focus more on the mathematical technique. As such I'm wondering about the type of real world experiments that we should include, because, of course, without some really good real world results, our goose is as good as cooked. It seems to me that there is a bit of a trade off between two factors concerning the real world data:

One factor tends toward the simplicity of any tests that we include. In other words we should keep the descriptions of tests quick and simple so as not to dilute the core of the paper, i.e., the focus on the mathematical technique. This means performing tests that do not require significant explanation or specialization, or at the very least, we should distill the explanations to as simple a form as possible.

At the same time I feel we should be cautious about keeping our experiments too simple. My main concern is that simply comparing our method on a few data sets and publishing our results isn't going to be interesting enough to warrant an acceptance into the conference. Naturally (and hopefully) once we run these experiments we'll be able to find something interesting to say beyond just comparing methods. Specifically something we talked about in meeting today regarding quantitative data we could report, was comparing the variance explained versus sparsity. This would be appropriate toward PCA and CCA methods. As to FDA methods we talked today about using datasets that were used in prior sparse SVM papers. (Something I'll look into.) I could also talk to Ton about more sparse data sets.

The approach we're taking for now, seeing as how we have a semi-comfortable cushion of time left before the deadline, is to run this baby on as many tests as possible and choose a direction once the smoke clears. So if PCA methods work out best we can focus more on the speed/performance benefits of our new sparse method, where as if we can include an interesting CCA test, that would be a good contribution for this paper seeing as how there is fewer sparse CCA papers out there.

On 1/24/07, Bharath Kumar SV wrote:Hi Gert

Me and David have been working on this sparse eigen value problems. i
extended the framework of jason weston and it kind of comes out to be
a neat framework and seems to be working. but we need to set up
experiments similar to ur SIAM paper so as to compare this approach.
interestingly, i was re-reading ur sparse pca paper and finally
realized that the reason for SDP might be to make the problem linear
in its objective as we are maximizing a convex objective. even without
zero-norm constraint and just using L1 constraint, the problem is
still hard to solve as it is maximization of a convex function. to
over come that, u guys have done this lifting stuff and then relaxed
the zero-norm constraint. interestingly for sparse pca, the zero-norm
relaxation is effectively yielding 1-norm relaxation.

Now, if you remember the short report i showed following Laurent El
Ghaoui's work. I realized that there we have maximzation of sum of max
functions..which agian is hard to solve. I am actually going to do
Weston type extension to that method and see how it relates to the
present algorithm we have.

We can target for ICML as a more general paper on sparse component
analysis where we talk about sparsity for generalized eigen value
problems and then how pca, cca, fda, sparse dictionary learing etc.
falls into this framework. But somehow I am wondering whether it
dilutes the paper by not focussing on one thing. The other approach i
am thinking is do this generalized eigen value problem and reduce pca
as a special case and do the type of experiements you people have done
showing the validity of the method. this covers the unsuprevised part.
for the supervised part, we cna use fda and try to learn sparse linear
discriminators. people have done feature selection for svms and we do
it using fda.

The other things we might give a passing reference and we can sum it
all in a good journal paper.

Please let us know what you think.

Regards
Bharath

Thursday, January 18, 2007

next steps...

Just to reiterate, we're breaking this project into two parts. The first part comprises the submission to ICML, in which we present our statistical method as generally and as accessibly as possible. The second part will be applying the method to a very specific application ( the vocab selection problem ) which will be submitted to a specialized conference.

As for the organization of the ICML entry, I'm looking at Gert's NIPS paper on sparse PCA as an example, seeing as how our method is very similar to the one on this paper. There are three main components to this paper: mathematical formulation, results on synthetic data, and results on a real world dataset.

As to the mathematical formulation for this paper, this part is essentially complete unless we opt perform some alternative formulations of the mathematics. I had a meeting with Bharath yesterday in which we talked about some alternative methods for approximating the zero-norm function. Some of these methods lead to non-convex problem and thus we are cursed with the existence of many local minima. However these methods may approximate the solution better, and they are mathematically more interesting (though I may be saying that because I've been staring at the same damn formulations since last summer.) The real benefits of some of these alternative methods, according to Bharath, is that they handle scalability more elegantly. We're concerned right now over how long it is going to take to solve some of these systems.

Though the execution time of these program is a concern, there is a light at the end of the tunnel, so to speak. Experiments that we have run so far do finish in an appropriate time. The amount of time we can expect to wait is related to the dimensionality of the data points we are looking at. Based on experiments run on synthetic data, we can process 10 dimensional data ( a 10 x 10 matrix ) in about less than half a second; 100 dimensional data in about 3 to 5 seconds and 1000 dimensional data in about 1 to 1.5 hours. Many of the the interesting problems out there live in this 1000 dimensional world, and its good news to see that the waiting time isn't too bad (not only that but the waiting time isn't to bad on SeDuMi, which is a lot slower than some of the other packages out there.

Keypoints:
  • Look at alternative formulations only after we decide which experiments we are going to run.
  • Even if dimensionality is high, we might not need to wait too long to run experiments.

As for synthetic data experiments, I haven't given much thought to what we might be able to do other than some of the experiments that i have already run. Specifically I have one test which compares the correlation scores of various methods when applied on random data. There was also an additional experiment, in which data was constructed that was explicitly sparse (unlike the random data in the last experiment) in an attempt to ensure that the sparse CCA method did, in fact, hone in on the correct dimensions. One thing I didn't do is check how other methods perform under this inherently sparse data. Plain vanilla CCA did the best in the non-sparse experiments, and this is to be expected. In the first experiment CCA was used as an upperbound for the performance of the sparse methods. One thing I haven't seen, is if CCA underperforms on inherently sparse data. I suspect this would be the case based on some results that Gert published in his paper.

If sCCA outperforms CCA on inherently sparse data, and does so consistently, then it gives a lot of relevance to our method since it could be argued that a lot of real world data is sparse.

Keypoints
  • Compare all methods on the inherently sparse data
  • If sCCA outperforms CCA on sparse data, it gives us a lot of relevance.

The real world application is going to be a crucial component of this paper, and since we've decided to postpone doing the vocab selection problem because it's too narrow of an application, the next major task to find a collection of experiments that we can run that we can compare our results to. Gert already mentioned using the French/English text dataset, CCA has also been used in a few medical applications, there is the Pixels That Sound paper, and also the Image/Hypertext experiment that Bharath told me about. These all sound like good candidates, hopefully they can be run in a tractable amount of time.

The first priority for now is to find these and other datasets/experiments to compare ourselves to. After we find some we can look and them and decide if they will fit easily into our framework, or if we need to change our formulation and implementation to fit these experiments.

Wednesday, January 17, 2007

notes from meeting with doug

I had a talk with Doug last night about the direction in which to take this project (at least as far as the the ICML deadline is concerned.) As was the original motivation, I was planning on showing how you can use CCA to (try to) solve a vocabulary selection task, specifically on our music/text data set.

A key insight is that this puts an emphasis on the application, not the method. With this being a general machine learning conference, emphasis on this admittedly narrow application may not be the best way to impress upon the community the usefulness of CCA, much less sparse CCA. It is true that CCA is a far lesser known cousin to PCA. While the name is thrown around the hallways of machine learning departments all around, it does seem like CCA is one of those tools that falls short of being standard canonical instruction material. I'm not touting that CCA should be taught in all machine learning intro courses or anything like that, CCA is in some respects deservingly relegated to the second ranks of statistical tools because the conditions in which such a tool would be used (i.e., when dealing with heterogeneous) is rarer.

However dealing with heterogeneous data is a very obvious ground for exploration, and anyone interested in such a type of data modeling should be aware of this tool. This is the point that I will try drive home for this paper, that, in a general machine learning framework, CCA is a very important tool when it comes to modeling heterogeneous data, and that additionally there is this slick way to impose sparsity on the solution. So in some respect part of this may be doing some consciousness raising, depending on just how much previous research has been on concerning CCA.

The next step is to scour the web for any good CCA data sets that may be around, and performing a more comprehensive literature review.

Thursday, January 11, 2007

Sanity Check

As SCCA is based on solving an SPCA problem I ran a sanity check by running SPCA on synthetic data as done in Gert and Jordan's paper. Sure enough it does work. It the test they created a synthetic random vector v = ( 1 0 1 0 1 0 1 0 1 0 )' and running SPCA on U + Sig * v * v'. Where U is noise, and Sig is a signal to noise ratio.

You can run SCCA with a cardinality constraint set to k = 1 to 10. Very interesting experiment. The inherent cardinality of the example is 5. This becomes apparent in the SPCA direction when k = 6. When k = 5 sometimes the PCA directions flips negative other times positive. As k goes up to 10 the non-zero elements become less sparse in w. Similarly as k goes to 1.

This bring up an imporant issue to keep in mind while I'm running experiments, and that is that the value of k, the intrinsic cardinality of the problem, is an unknown parameter that will need to be honed in on. (Since this value is bounded by the dimentionality of the data, we can perform a binary search on the data to find this value.)

Back on Track

So took a bit of a hiatus on the sparse CCA project last quarter to help out my lab mates, Doug and Luke, on another music research paper. Now that that's over with I'm back to my project and committed to getting serious work done as, for one, my time and money here in UCSD is running out, and two, there's a great opportunity to get a research paper out of this as ICML has their deadline next month.

The bad news is that that gives about a month to get this baby off of the ground. With a little bit of luck (i.e. hard work) perhaps I can make the deadline.

Anyway, at the end of last quarter I ran some experiments comparing sCCA with other methods of inducing sparsity. I wanna talk about some of the results here. Before I do that I want to add a note about something I stumbled across as solution to the regularization issue that I encountered a few months ago.

The regularization issue arises because of the problem constraint, x'X'Xx = 1 (and similiarly for y) In the math, it turns out that in solving the CCA problem involves inverting the matrix X'X, which is an estimation of the covariance matrix for our model. Note that a problem could surface if the eigenvalues of X'X are small, because any small error in the estimation of these values blows up when you invert the matrix.

The solution to this is to add a small diagonal component to the matrix ( X'X + a*I, for a small a ) which effectively restricts the degrees of freedom of the solutions. Admittedly I'm not too clear on why this works , I need to get back to the math and figure this out.

Of course you can get rid of this regularization issue entirely by replacing the contraint above with x'x = 1. Doing this, the matrix X'X never shows up in the math and hence, no regularization problem. Doing this means that we are solving the CCA problem (x'X'Yy) subject to the constraints that the vectors x and y are of unit length. In doing this we lose the interpretability of the correlation score, x'X'Yy, because they are no longer correlations proper, i.e., values between 0 and 1. However in the case when we are only concerned with the correlation directions and not the values then this is okay. Besides you can always go back and find the scale factor that will give you a correlation using elementary algebra.

This said, I think this is a much better method to solve CCA as opposed to the standard method with regularization as is commonly presented. The old method imposes a regularization parameter which has to be hand tuned, and I have yet to come across a principled way of setting this parameter. Through this second method, you can simply set the constraint x'x = 1 (and similarly for y) and forget all about regularization.

Now lets shift gears to some quantitative studies I performed at the end of last quarter. I compared sCCA to the unconstrained CCA problem and with two alternative methods of imposing sparsity. One method of imposing sparsity is simply to solve the CCA problem in which you constrain the 1-norm of the canonical directions. This is a very popular method of imposing sparsity constraints. Another method is to simply threshold the unconstrained CCA directions, in my experiment I simply pick the top n directions where n is a randomized parameter in the experiments.



In all experiments we constrain the 2-norm of the canonical directions to be unit length, hence we can compare the resulting scores. The figure above shows the covariance scores for each method on synthetic data. It's a little hard to read so let me say here that bar 1 shows the score for 1-Norm CCA, bar 2 is for sparse CCA, bar 3 is for thresholded CCA, and bar 4 is for plain-vanilla unconstrained CCA (These values are averages over about 50 trials, their standard deviations are all low). As expected all scores do no better than than unconstrained CCA. It's encouraging that sCCA (bar 2) does as well as it does, especially compared to how poorly norm-1 CCA performs, however
a quick look at the actual sparsity of sCCA shows that it is not much sparser than the unconstrained CCA directions. See the figure below.


Again, it looks a littel crappy, sorry, but in this figure the panels, reading them left to right, top to bottom, the first panel shows the values of one of the canonical directions for the 1-Norm CCA. Very sparse wouldn't you say? Panel 2 shows sCCA, Panel 3 shows thresholded CCA where we keep the top 10 largest entries, and Panel 4 shows the unconstrained CCA direction.

The important thing to notice is that sparse CCA really isn't that much more sparse than the unconstrained direction. This could mean that adding sparsity constraints is a bit of overkill, of course the unconstrained solution is itself pretty sparse. If I can do it quickly, I'd like to set up a test on synthetic data which isn't so sparse.

I need to go back and check out Gert's Sparse PCA paper again and see just how sparse their solutions are.


So to recap here, I'm concerned that even though Sparse CCA is scoring pretty high, it isn't doing much more than plain CCA would. If this is the case then adding the sparsity constraints is unnecessary overhead. There are a few nuances to dig into though, because my sparse CCA solution is based on solving a sparse PCA problem, which of coarse has worked well for Gert.