Memory requirement for PCA/kPCA

Question:

Is there a way to know exactly how much memory I will need to do PCA/kPCA in Python?

For example, if I have a matrix of N rows and M columns:

What memory will I need if N = 100, 10000, 100000?
And does M have an effect on the memory needed for PCA/kPCA?

Answers:

Interesting question. From what I can glean from the paper Online principal component analysis in high dimension; which algorithm to choose? by Cardot & Degras (2015) the complexity of PCA depends on the SVD (singular value deomposition) step. Thus, the space complexity of ‘batch’ (in memory) PCA is, using your notation, at least O(NM) for the data, plus O(M²) for the covariance matrix, so the complexity is O(M²) in total.

This is all backed up by this tutorial paper by Li et al. (Note that your N is their m and your M is their n.) They give more detail though, showing that about 3 copies are made of each matrix, and they show how the choice of algorithm matters.

This answer to a question about NumPy’s SVD implementation might help you compute the exact memory footprint of the process. Or check out a tool like memory-profiler, which will also give you an exact answer.

In short, the memory required depends on the algorithm but is very roughly 3 × (NM + M²). You can assume that each float in those arrays needs 8 bytes of memory.

Answered By: kwinkunks