Understanding UMAP (2019)

ubutler · on May 6, 2024

Check out PaCMAP, AFAIK it’s the current SoTA in dimensionality reduction: https://github.com/YingfanWang/PaCMAP

abhgh · on May 7, 2024

Yes, also TriMap does quite well. PacMAP is faster though. This paper [1] (linked at the GitHub repo you have mentioned) goes into a fantastic amount of detail in comparing UMAP, t-sne, PacMAP and TriMAP.

[1] https://jmlr.org/papers/volume22/20-1061/20-1061.pdf

rsfern · on May 7, 2024

Thanks for the link, that is a really nice paper!

I found it really refreshing that they report a bunch of stuff they tried that didn’t work in a way that clarifies the problem and leads to a lot of insight into the strengths and limitations of their final method and the leading alternatives

abhgh · on May 8, 2024

You're welcome! There is a talk too [1], which is how I had learned about the paper.

[1] https://www.youtube.com/watch?v=sD-uDZ8zXkc

clircle · on May 7, 2024

>1. Hyperparameters really matter

>2. Cluster sizes in a UMAP plot mean nothing

>3. Distances between clusters might not mean anything

>4. Random noise doesn’t always look random.

>5. You may need more than one plot

Oh OK, so this is basically impossible to know if you are learning something or inventing garbage.

kelseyfrog · on May 7, 2024

The big takeaway is that despite these UMAP is still useful.

I've used UMAP in the past. It's not quite as bad as you're suggesting. Points 2,3 and 4, are going to be things that you're going to want to verify quantitatively anyways. Despite this, it's still a fine way to throw points up and start exploring - just don't use it as the end all, be all.

nerdponx · on May 7, 2024

And ironically these exact problems are why people stopped using t-SNE in favor of UMAP originally.

frgtpsswrdlame · on May 7, 2024

Judging by my twitter feed, some scientists are starting to push back on the use of UMAP (and t-SNE.)

abhgh · on May 7, 2024

I think a common problem is that these techniques get repurposed to solve problems that they weren't meant to. I have seen multiple people fall too often into the trap of using these visualizations to guess whether a dataset may be classified with high accuracy. I'm talking about cases where there already is a label - but viz. is used as a prior compute-cheap step to understand whether they would bother with classification at all, or should they pick a weak-vs-strong classifier, etc.

The problem of course is the insights from viz. provide "one-sided" information: IF your instances from different classes look separated, then you know that a decent classifier would do the job well. But if they don't appear separated, you don't know whether they can't be accurately classified: for all you know you don't have the right hyperparams. Also account for the fact that you're projecting d-dimensional data down to 2D/3D - this is heavily lossy; even with the right hyperparams there is a chance you won't see high separation. If you want to classify, just classify.

cinntaile · on May 7, 2024

What do they suggest you use instead?

__mharrison__ · on May 7, 2024

PCA for starting out

cinntaile · on May 7, 2024

As a sibling comment already mentioned that doesn't really work for non-linear data. UMAP and t-SNE are techniques used on non-linear data.

roseway4 · on May 7, 2024

PCA performs horribly on non-linear data.

__mharrison__ · on May 7, 2024

How will you know if you have non-linear data....

explodingman · on May 6, 2024

Really enjoying this

josh-sematic · on May 6, 2024

Yes, the visualizations are wonderful. It must have taken quite some time to produce the data to allow playing with the hyperparameters so smoothly in some of the examples.

Demiurge · on May 7, 2024

really confused by this