Find the most similar subsequence in another sequence when they both numeric and huge

Question:

I have two numeric and huge np.arrays (let’s call them S1 and S2, such that len(S1)>>len(S2)>>N where N is a very large number). I wish to find the most likely candidate part of S1 to be equal to S2.

The naive approach would be to compute a running difference between S2 and parts of S1. This would take too long (about 170 hours for a single comparison).
Another approach I thought about was to manually create a matrix of windows, M, where each row i of M is S1[i:(i+len(S2)]. Then, under this approach, we can broadcast a difference operation. It is also infeasible because it takes a long time (less than the most naive, but still), and it uses all the RAM I have.

Can we parallelize it using a convolution? Can we use torch/keras to do something similar? Bear in mind I am looking for the best candidate, so the values of some convolution just have to preserve order, so the most likely candidate will have the smallest value.

Asked By: David Harar

||

Answers:

I am assuming you are doing this as a stepping stone to find the perfect match

My reason for assuming this is that you say:

I wish to find the most likely candidate part of S1 to be equal to S2.

  • Start with the first value in the small array.

  • Make a list of all indices of the big array, that match that first value of the small array. That should be very fast? Let’s call that array indices, and it may have values [2,123,457,513, …]

  • Now look at the second value in the small array. Search through all positions indices+1 of the big array, and test for matches to that second value. This may be faster, as there are relatively few comparisons to make. Write those successful hits into a new, smaller, indices array.

  • Now look at the third value in the small array, and so on. Eventually the indices array will have shrunk to size 1, when you have found the single matched position.

If the individual numerical values in each array are 0-255, you might want to "clump" them into, say, 4 values at a time, to speed things up. But if they are floats, you won’t be able to.

Typically the first few steps of this approach will be slow, because it will be inspecting many positions. But (assuming the numbers are fairly random), each successive step becomes much faster. Therefore the determining factor in how long it will take, will be the first few steps through the small array.

This would demand memory size as large as the largest plausible length of indices. (You could overwrite each indices list with the next version, so you would only need one copy.)

You could parallelise this:

You could give each parallel process a chunk of the big array (s1). You could make the chunks overlap by len(s2)-1, but you only need to search the first len(s1) elements of each chunk on the first iteration: the last few elements are just there to allow you to detect sequences that end there (but not start there).

Proviso

As @Kelly Bundy points out below, this won’t help you if you are not on a journey that ultimately ends in finding a perfect match.

Answered By: ProfDFrancis

Since you are working on a large dataset, the efficiency of the algorithm matter far more than using more computing resources (eg. cores) to compute the result faster.

It looks like you want to detect a 1D pattern in a larger 1D array, while possibly have a good resistance to noise. This operation can typically be done using a correlation, more specifically a cross-correlation. This operation can be done efficiently using a fast Fourrier transform (FFT). The idea is to compute the FFT^-1(FFT(a) * FFT(b)). This methods is also often used to efficiently compute convolutions. It is very useful to efficiently detect patterns. In fact, FFTs has been developed to detect nuclear weapon tests in seismographs during the cold war. Since they have been used in a lot of scientific project including the detection of the gravitational ways (more specifically the detection of the merge of two black holes which leaves a specific event signal shape in a very huge dataset). Using FFTs is generally efficient since they run in O(n log n) (as opposed in O(n²) for naive correlations).

Doing this processing manually is quite cumbersome. Hopefully, Scipy provides correlation functions that can use FFTs: scipy.signal.correlate. The signal needs to be normalized for the correlation to provide interesting results. In practice, subtracting the mean is often sufficient. Here is an example:

import scipy.signal
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# What needs to be found
needle = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1])

# The "large" dataset containing the searched pattern,but with some small noise
haystack = np.hstack([np.random.rand(3000), needle + np.random.rand(len(needle))*0.001, np.random.rand(2000)])

# Actual correlation (with a simple pseudo-normalization)
mean = haystack.mean()
corr = scipy.signal.correlate(haystack-mean, needle-mean, method='fft')

# Clean the results to make the interesting part more visible in plots
plotted = np.choose(corr<0, [corr**2, 0])

# Print the 5 best candidates
print(argpartition(corr, -5)[-5:])

plt.plot(plotted)
plt.show()

Here is the resulting output:

enter image description here

[2331 3017 2743 2742 3016]

Here, we can see that the correlation succeed to find the pattern in the 5 best candidates despite the noise. Its location is 3016. It is actually the best candidate here but there is no guarantee for the correlation to find the best candidate directly. You need to take the K first and check each of them to find the right location. The bigger the dataset, the bigger K. You can set K to 1% of the dataset if it pretty big and the searched pattern is clearly identifiable compared to the values in the dataset. The K first value can be sorted to check the best candidate first (in the K ones). The check can be done using the L2 norm np.sum((haystack_block - needle)**2)**0.5. The candidate with the best L2 norm is the best one. The correlation is an hint to avoid computing the L2 normal of all possible candidate (5 compared to several thousands in the above example).

Answered By: Jérôme Richard
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.