In machine learning dimensionality simply refers to the number of features(I.e. input variables in the datasets). when the number of features is very large relative to the number of observations in the dataset certain algorithms struggle to train efficient models.
Let me put it this way imagine that you have a straight line of hundred meters long and you dropped a coin as Euro somewhere on it wouldn’t be too hard to find. You take some steps along the line and it takes you a couple of minutes. Now let’s say you have a square of one hundred meters and you dropped a Euro somewhere on it. It would be very hard like searching across one football field. It could take you a couple of days. Now imagine we have a cube of hundred meters across. That’s like a 30-storey building the size of a football stadium. The difficulty of searching through space gets a bit harder as you have more dimensions.
Check this beautiful image of cats and dogs from this website.
How do you combat the curse of dimensionality?
- Change the algorithm.
- Reduce the dimensionality of the data.
And in order to diminish the dimensionality a first solution is to apply Principal Component Analysis (PCA), and if not working, then other solutions would be:
- Feature selection algorithms.
- Non-linear dimensionality reduction.
- Feature hashing.
- Clustering using K-Means.
Why you apply dimensionality reduction?
- Increase in efficiency.
- It helps in data compression and reduced storage space.
- It reduces computation costs.
- It also helps remove redundant features.
- It fastens the time required for performing the same computations.
- It improves the classification performance.
- Ease of interpretation and modeling.
Why shouldn’t use apply dimensionality reduction?
- Basically, it may lead to some amount of data loss.