sparseCCA: August 2006

Friday, August 25, 2006

Meeting with Gert

I had a meeting with Gert this afternoon. We focused on the dual formulation of the sparse CCA problem. The sparse CCA problem can viewed as a quadratically constrained quadratic program (QCQP), (one is optimizing a quadratic function xX'Y'y over the vectors x,y, subject to the quadratic constraints x'X'Xx = 1 and y'Y'Y'y = 1). This program can be written into a semidefinite program (SDP) and the dual can be derived (which is also an SDP).

My dual formulation is essentially correct. Though there are some modifications that Gert and I spoke of that I will need to include regarding how the cardinality constraints are introduced. So that means that I will be going back to the pencil and paper and crunching out a little bit more math.

Right now I need to get some data off our servers and onto my lap top because i'm going to be out of town the next 2 weeks and I'd like to start some demo experiments with our audio data. I'm pretty sure I can get at least a good 4 hours of work in each day. Most of my friends back in town are back in school, so its not like i'll be very distracted on weekdays.

Links: Mathematical Solvers

Here are some links to some mathematical solvers I'll be needing.

MOSEK (www.mosek.com) - used to solve quadratically constrained quadratic problems, conic quadratic problems, and general convex nonlinear problems. It contains interfaces for C/C++, .NET and Java and comes with a toolbox for use in MATLAB. A temporary license is available for students.
SeDuMi (sedumi.mcmaster.ca) Hehe, so do me! Lol. Anyway, SeDuMi is a Semi-Defninite Program solver and Second Order Cone Program solver. It's a MATLAB toolbox. It is available under the GNU free public licence.

Thursday, August 24, 2006

CCA matlab code

The following code performs CCA analysis. Inputs are the data matricies X and Y, with one data point per row. The output is the directions of maximum correlation wx and wy, i.e., those values that maximize the value (X*wx)'*(Y*wy)

%Start CCA:

%calculate covariance matricies
Sxx = X'*X ;
Syy = Y'*Y ;
Syx = Y'*X ;
Sxy = X'*Y ;

[r c] = size( Sxy );

A = [ zeros(r,c) Sxy; ...
Syx zeros(r,c) ];

B = [ Sxx zeros(r,c); ...
zeros(r,c) Syy];

%solve the generalized eigenvalue problem: A*w = c*B*w
[V D] = eig(A, B);
%[V D] = eig( inv(B)*A ); %alternative way of solving problem

%find the directions of max correlation
[maxVal maxInd] = max( diag(D,0) );

w = V(:,maxInd); %Get collumn coresponding to max eigenvalue
wx = w(1:2);
wy = w(3:4);

Some notes on CCA

I spent the last day or so playing around with a CCA (Canonical Correlation Analysis) demo I created. It was pretty simple, I generated a random cluster of data for the first view, and for the second view I randomly rotated this first view and added noise. The results were a little surprising though. When I didn't change the second view (that is the first view and second view are identical) I intuitively expected CCA to find the main diagonals going through the clusters as maximum correlation directions, however this turned out be false. While the main diagonals do lead to maximal correlation, other directions lead to the same correlation too, and CCA seems to pick up on these non-diagonal directions more readily than not. This leads to the conclusion that, in general, the optimization problem used to solve CCA does not have a single global optimum.

When the two views were randomized, again off diagonal directions were being picked up as having maximal correlation (this was a value of 0.5) When the diagonal were tested they usually came close to 0.5, though not actually equal to it.

The problem that I have with these results, on a gut level, is that CCA picks out correlation directions that may not be the most interpretable directions.

Nevertheless when you look at the two directions they do correspond to each other. For example, the correlation direction in one view is the same direction in the other but rotated by the appropriate amount. So in a sense the CCA correlation directions provide a "compass" between the two views; in other words, it provides a canonical representation of the clusters by giving us corresponding "axes" in the form of correlation directions.

Figure: Correlation directions found by CCA.
The second view is generated by rotating the first view through a random angle only. Green lines are the correlation directions. The direction of maximal correlation is that line closest to the cluster diagonal.

Figure: Correlation directions found by CCA.
The second view is generated by rotating the first view then adding random noise. Green lines are the correlation directions. The direction of maximal correlation is that line closest to the cluster diagonal.

Tuesday, August 22, 2006

Dual Formulations, Demos, Plans

This week was a little short on work thanks to a Vegas trip last weekend and Bats Day up in Disneyland this weekend. But I did get a little done. I finished a SDP formulation of the sparse CCA dual. I'm fairly certain it's correct, however its formulation ultimately leads to some problems in recovering the canonical correlation directions. I have emailed my advisor, Gert, about this and await his reply soon. (Note: I wanna look at formulating some of the quadratic constraints using the Schur complement instead of the straightforward method I used.)

I started a little matlab demo of CCA do get some more intuition about the technique. I think I have a logical bug in my code cause my canonical directions seem to be exactly 90 degrees off from what they should be. I'll poke around that a little bit more tomorrow on that.

As for the next steps in the project:

As soon as I finished up this little demo (should only take another hour or two of playing around with it) I want to demo sparse CCA (sCCA), first on matlab, using matlab's QCQP or SDP solvers. I would like to play around with the primal and dual formulations to see how similar I can get the results to be. It will be a good sanity check that my dual formulation is correct. But like I said, there may be an issue regarding the recovery of the correlation directions.
Once I'm done playing around in matlab land I may take a crack at using a standard mathematical solver, just to get acquainted with it. Using a solver written in a lower level language should outperform matlab in terms of speed.
Next step is to run a sCCA test with the Audio-Text data set that will be the subject of the main experiment. I can take a few songs and text and see what output I can get. I don't expect to see any signal in these results. This step will serve mostly to cement the implementation details of the experiment.
After the previous steps are done I should be ready to run the complete test on our test data. The faster I can get these babies going the better. With a little luck, early results could translate into a quick and dirty paper deadline at the end of the month.

here goes nothing...

Friday, August 11, 2006

Tijl De Bie CCA papers

Tijl De Bie has some published some CCA papers (link here) that I should include in my litterature review. One paper of interest relates CCA to the topic of mutual information.

Thursday, August 10, 2006

A quick and dirty intro to CCA.

CCA -

A good introduction to CCA can be found in the paper Eigenproblems in Pattern Recognition. In short CCA works as follows: CCA assumes that you have two views of the same underlying data. Think two data clouds that "live" in different feature spaces. The goal of CCA is to find two linear projections of these data sets such that the correlation of the two projections is maximized.

For example, previous applications of CCA onto real world problems have included finding correlations between two sets of text documents in two different languages. CCA should in theory pick up words that are strongly correlated to each other across documents, effectively learning word translations. In another application researchers apply CCA to find pixels that are strongly correlated to sound in a video.

Mathematically, CCA is posed as the following maximization problem:

find the directions x, y that maximize the correlation x'X'Yy

such that |Xx| = 1 and |Yy| = 1.

The apostrophe operator ' denotes matrix transpose. X and Y are the two data matrices, with one data point per row. The values Xx and Yy are the projections of the data onto the directions x and y. If you fix the projection lengths to one, then the correlation of the projections is given by (Xx)'(Yy) = x'X'Yy.

The solution to the optimization problem above reduces to a generalized eigenvalue problem. It should also be noted that the dual formulation of the problem is kernelizable.

Sparsity -

In general the solution directions that result from the CCA problem are linear combinations of all the dimensions of the respective feature space. This can lead to problems in interpreting the results of CCA. For example, in the language translation task above, the directions that maximize correlation would be a linear combination of words in the respective word spaces. What does it mean for one to have .5 of one word plus .3 or another word? The researchers in the Pixels That Sound example encounter the same problem. They find that the projection direction in pixel space that is correlated to sound is a linear combination of many (nearly all?) the pixels in the pixel space. This means that picking out specific pixels to correlate to sound is difficult because CCA tells you that all of the pixels are at least slightly correlated.

To alleviate this problem you can add a sparsity constraint to the optimization problem: you restrict the number of dimensions that the directions x and y live in. This means that instead of linear combinations of all the dimensions in your feature space, you get a small collection of highly correlated dimensions. In the language translation example you should get a small collection of relevant words. In the Pixels That Sound example this gives you a small collection of highly correlated pixels.

Of course adding the sparsity constraint makes the optimization problem more difficult to solve, (in the worst case you make the problem NP-hard.) So the practical solution is to relax the sparsity constraints and solve an approximate problem. This kind of relaxation is performed for sparse PCA (another similar statistical method) by my advisor, et al., in this paper.

Duality - (What I've been working on so far.)

Optimization problems such as the one above have duals, i.e., alternate formulations. Sometimes these formulations are easier to solve the original problem, lead to interesting insights, allow the problem to be kernelizable. Other times they are useless. I have spent the last week or so playing around with the sparse CCA problem in hopes of deriving a dual formulation. The original problem is posed as a Quadratically Constrained Quadratic Program (QCQP). I have tried to formulate a dual which is also a QCQP but have come across some problems. I was about to all it quits when I came across some references indicating that, in general, the dual of a QCQP is a semidefinite program problem (SDP) so I think I'm going to give that a crack for another few more hours.

sparseCCA