Thursday, January 18, 2007

next steps...

Just to reiterate, we're breaking this project into two parts. The first part comprises the submission to ICML, in which we present our statistical method as generally and as accessibly as possible. The second part will be applying the method to a very specific application ( the vocab selection problem ) which will be submitted to a specialized conference.

As for the organization of the ICML entry, I'm looking at Gert's NIPS paper on sparse PCA as an example, seeing as how our method is very similar to the one on this paper. There are three main components to this paper: mathematical formulation, results on synthetic data, and results on a real world dataset.

As to the mathematical formulation for this paper, this part is essentially complete unless we opt perform some alternative formulations of the mathematics. I had a meeting with Bharath yesterday in which we talked about some alternative methods for approximating the zero-norm function. Some of these methods lead to non-convex problem and thus we are cursed with the existence of many local minima. However these methods may approximate the solution better, and they are mathematically more interesting (though I may be saying that because I've been staring at the same damn formulations since last summer.) The real benefits of some of these alternative methods, according to Bharath, is that they handle scalability more elegantly. We're concerned right now over how long it is going to take to solve some of these systems.

Though the execution time of these program is a concern, there is a light at the end of the tunnel, so to speak. Experiments that we have run so far do finish in an appropriate time. The amount of time we can expect to wait is related to the dimensionality of the data points we are looking at. Based on experiments run on synthetic data, we can process 10 dimensional data ( a 10 x 10 matrix ) in about less than half a second; 100 dimensional data in about 3 to 5 seconds and 1000 dimensional data in about 1 to 1.5 hours. Many of the the interesting problems out there live in this 1000 dimensional world, and its good news to see that the waiting time isn't too bad (not only that but the waiting time isn't to bad on SeDuMi, which is a lot slower than some of the other packages out there.

Keypoints:
  • Look at alternative formulations only after we decide which experiments we are going to run.
  • Even if dimensionality is high, we might not need to wait too long to run experiments.

As for synthetic data experiments, I haven't given much thought to what we might be able to do other than some of the experiments that i have already run. Specifically I have one test which compares the correlation scores of various methods when applied on random data. There was also an additional experiment, in which data was constructed that was explicitly sparse (unlike the random data in the last experiment) in an attempt to ensure that the sparse CCA method did, in fact, hone in on the correct dimensions. One thing I didn't do is check how other methods perform under this inherently sparse data. Plain vanilla CCA did the best in the non-sparse experiments, and this is to be expected. In the first experiment CCA was used as an upperbound for the performance of the sparse methods. One thing I haven't seen, is if CCA underperforms on inherently sparse data. I suspect this would be the case based on some results that Gert published in his paper.

If sCCA outperforms CCA on inherently sparse data, and does so consistently, then it gives a lot of relevance to our method since it could be argued that a lot of real world data is sparse.

Keypoints
  • Compare all methods on the inherently sparse data
  • If sCCA outperforms CCA on sparse data, it gives us a lot of relevance.

The real world application is going to be a crucial component of this paper, and since we've decided to postpone doing the vocab selection problem because it's too narrow of an application, the next major task to find a collection of experiments that we can run that we can compare our results to. Gert already mentioned using the French/English text dataset, CCA has also been used in a few medical applications, there is the Pixels That Sound paper, and also the Image/Hypertext experiment that Bharath told me about. These all sound like good candidates, hopefully they can be run in a tractable amount of time.

The first priority for now is to find these and other datasets/experiments to compare ourselves to. After we find some we can look and them and decide if they will fit easily into our framework, or if we need to change our formulation and implementation to fit these experiments.

0 Comments:

Post a Comment

<< Home