Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Understanding UMAP (2019) (pair-code.github.io)
35 points by josh-sematic on May 6, 2024 | hide | past | favorite | 17 comments


Check out PaCMAP, AFAIK it’s the current SoTA in dimensionality reduction: https://github.com/YingfanWang/PaCMAP


Yes, also TriMap does quite well. PacMAP is faster though. This paper [1] (linked at the GitHub repo you have mentioned) goes into a fantastic amount of detail in comparing UMAP, t-sne, PacMAP and TriMAP.

[1] https://jmlr.org/papers/volume22/20-1061/20-1061.pdf


Thanks for the link, that is a really nice paper!

I found it really refreshing that they report a bunch of stuff they tried that didn’t work in a way that clarifies the problem and leads to a lot of insight into the strengths and limitations of their final method and the leading alternatives


You're welcome! There is a talk too [1], which is how I had learned about the paper.

[1] https://www.youtube.com/watch?v=sD-uDZ8zXkc


>1. Hyperparameters really matter

>2. Cluster sizes in a UMAP plot mean nothing

>3. Distances between clusters might not mean anything

>4. Random noise doesn’t always look random.

>5. You may need more than one plot

Oh OK, so this is basically impossible to know if you are learning something or inventing garbage.


The big takeaway is that despite these UMAP is still useful.

I've used UMAP in the past. It's not quite as bad as you're suggesting. Points 2,3 and 4, are going to be things that you're going to want to verify quantitatively anyways. Despite this, it's still a fine way to throw points up and start exploring - just don't use it as the end all, be all.


And ironically these exact problems are why people stopped using t-SNE in favor of UMAP originally.


Judging by my twitter feed, some scientists are starting to push back on the use of UMAP (and t-SNE.)


I think a common problem is that these techniques get repurposed to solve problems that they weren't meant to. I have seen multiple people fall too often into the trap of using these visualizations to guess whether a dataset may be classified with high accuracy. I'm talking about cases where there already is a label - but viz. is used as a prior compute-cheap step to understand whether they would bother with classification at all, or should they pick a weak-vs-strong classifier, etc.

The problem of course is the insights from viz. provide "one-sided" information: IF your instances from different classes look separated, then you know that a decent classifier would do the job well. But if they don't appear separated, you don't know whether they can't be accurately classified: for all you know you don't have the right hyperparams. Also account for the fact that you're projecting d-dimensional data down to 2D/3D - this is heavily lossy; even with the right hyperparams there is a chance you won't see high separation. If you want to classify, just classify.


What do they suggest you use instead?


PCA for starting out


As a sibling comment already mentioned that doesn't really work for non-linear data. UMAP and t-SNE are techniques used on non-linear data.


PCA performs horribly on non-linear data.


How will you know if you have non-linear data....


Really enjoying this


Yes, the visualizations are wonderful. It must have taken quite some time to produce the data to allow playing with the hyperparameters so smoothly in some of the examples.


really confused by this




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: