Most resource-efficient way to calculate distance between coordinates

Question

I am trying to find all observations that are located within 100 meters of a set of coordinates.

I have two dataframes, Dataframe1 has 400 rows with coordinates, and for each row, I need to find all the observations from Dataframe2 that are located within 100 meters of that location, and count them. Ideally,

Both the dataframes are formatted like this:

| Y    | X    |  observations_within100m  |
|:----:|:----:|:-------------------------:|
|100   |100   |          22               |
|110   |105   |          25               |
|110   |102   |          11               |

I am looking for the most efficient way to do this computation, as dataframe2 has over a 200 000 dwelling locations. I know it can be done with applying a distance function with something as a for loop but I was wondering what the best method is here.

Asked By: TvCasteren

||

Source

Answer 1

If there’s a small area you’re working on, you could make a grid of all known locations, then for each point precompute a list of which entries in df1 which are withing 100m from that point.

Step 2 would be to go thru the 200k lines df2 and increase the count for the df1 entries found at the point correspondingly.

Otherwise, this problem is similar to collision detection, for which there might be smart implementations. e.g. pygame has one, no idea though how efficient. Depending on how sparse the area is there might be gains thru dividing it into cells, so you’d only have to detect collision/distance for the entries in that cell, decreasing from 400 objects you’d have to check against for each of the 200k.

Hope the answer was helpful and good luck!

Answered By: A.Berg

Answer 2

In addition to my comment, one dirty quick way and much better than a for loop is to find points that are in the circle formed by the center given by each X, Y from df1.

You may try this:

distance = 100
df1['num_observations'] = df1.apply(
    lambda row: len(
        df2[(df2.X.sub(row.X) ** 2 + df2.Y.sub(row.Y) ** 2).le(distance**2)]
    ),
    axis=1,
)

You see the points which are at a desired distance must obey the equation (x-x1)^2 + (y-y1)^2 <= distance^2

Of course there are several optimizations that you can apply on top of this like you don’t need to search the whole df2 but only a certain part of it etc.

Answered By: SomeDude

Most resource-efficient way to calculate distance between coordinates

Question:

Answers: