Forming a symmetric matrix counting instances of being in same cluster
Question:
I have a database that comprises cities divided into clusters for each year. In other words, I applied a community detection algorithm for different databases containing cities in different years base on modularity.
The final database (a mock example) looks like this:
v1 city cluster year
0 "city1" 0 2000
1 "city2" 2. 2000
2 "city3" 1. 2000
3 "city4" 0 2000
4 "city5" 2 2000
0 "city1" 2 2001
1 "city2" 1 2001
2 "city3" 0 2001
3 "city4" 0 2001
4 "city5" 0 2001
0 "city1" 1 2002
1 "city2" 2 2002
2 "city3" 0 2002
3 "city4" 0 2002
4 "city5" 1 2002
Now what would like to do is counting how many times a city ends up in the same cluster as another city each year.
So in the mock example above I should end up with a 5 times 5 symmetric matrix where rows and columns are cities where each entry represent the number of times that city I and j are in the same cluster (independently of which cluster) in all years:
city1 city2 city3 city4 city5
city1 . 0. 0. 1. 1
city2. 0. . 0. 0. 1
city3. 0. 0. . 2. 1
city4. 1. 0. 2 . 1.
city5. 1. 1 1. 1. .
I am working in python but it’s fine even if the solution is in matlab or R.
Thank you
Answers:
In R, co-occurrence matrices are computed straightforwardly with table
and [t]crossprod
. We can compute the matrices by year and take the sum, like so:
con <- textConnection('
v1 city cluster year
0 "city1" 0 2000
1 "city2" 2 2000
2 "city3" 1 2000
3 "city4" 0 2000
4 "city5" 2 2000
0 "city1" 2 2001
1 "city2" 1 2001
2 "city3" 0 2001
3 "city4" 0 2001
4 "city5" 0 2001
0 "city1" 1 2002
1 "city2" 2 2002
2 "city3" 0 2002
3 "city4" 0 2002
4 "city5" 1 2002
')
d <- read.table(con, header = TRUE)
close(con)
x <- with(d, Reduce(`+`, apply(table(city, cluster, year), 3L, tcrossprod, simplify = FALSE)))
x
city
city city1 city2 city3 city4 city5
city1 3 0 0 1 1
city2 0 3 0 0 1
city3 0 0 3 2 1
city4 1 0 2 3 1
city5 1 1 1 1 3
There are threes on the diagonal because cities match themselves every year. If you prefer, say, zeros on the diagonal, then you can add:
diag(x) <- 0
If you don’t like the redundant annotation with "city", then you can add:
dimnames(x) <- unname(dimnames(x))
And if you want to store the result as a formally symmetric, formally sparse matrix, then you can add:
library(Matrix)
x <- as(x, "CsparseMatrix")
x
5 x 5 sparse Matrix of class "dsCMatrix"
city1 city2 city3 city4 city5
city1 . . . 1 1
city2 . . . . 1
city3 . . . 2 1
city4 1 . 2 . 1
city5 1 1 1 1 .
In R
, we may use crossprod
with table
– just paste
the ‘year’, ‘cluster’ to a single string, get the table
with city
and apply crossprod
on the output, then modify the diag
onal value by assigning it to 0
out <- `diag<-`(crossprod(with(df1, table(paste(year, cluster), city))), 0)
-output
out
city
city city1 city2 city3 city4 city5
city1 0 0 0 1 1
city2 0 0 0 0 1
city3 0 0 0 2 1
city4 1 0 2 0 1
city5 1 1 1 1 0
If we need a sparse option
library(Matrix)
> Matrix(out)
5 x 5 sparse Matrix of class "dsCMatrix"
city
city city1 city2 city3 city4 city5
city1 . . . 1 1
city2 . . . . 1
city3 . . . 2 1
city4 1 . 2 . 1
city5 1 1 1 1 .
data
df1 <- structure(list(v1 = c(0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 3L, 4L,
0L, 1L, 2L, 3L, 4L), city = c("city1", "city2", "city3", "city4",
"city5", "city1", "city2", "city3", "city4", "city5", "city1",
"city2", "city3", "city4", "city5"), cluster = c(0L, 2L, 1L,
0L, 2L, 2L, 1L, 0L, 0L, 0L, 1L, 2L, 0L, 0L, 1L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L, 2001L,
2002L, 2002L, 2002L, 2002L, 2002L)), class = "data.frame",
row.names = c(NA,
-15L))
I have a database that comprises cities divided into clusters for each year. In other words, I applied a community detection algorithm for different databases containing cities in different years base on modularity.
The final database (a mock example) looks like this:
v1 city cluster year
0 "city1" 0 2000
1 "city2" 2. 2000
2 "city3" 1. 2000
3 "city4" 0 2000
4 "city5" 2 2000
0 "city1" 2 2001
1 "city2" 1 2001
2 "city3" 0 2001
3 "city4" 0 2001
4 "city5" 0 2001
0 "city1" 1 2002
1 "city2" 2 2002
2 "city3" 0 2002
3 "city4" 0 2002
4 "city5" 1 2002
Now what would like to do is counting how many times a city ends up in the same cluster as another city each year.
So in the mock example above I should end up with a 5 times 5 symmetric matrix where rows and columns are cities where each entry represent the number of times that city I and j are in the same cluster (independently of which cluster) in all years:
city1 city2 city3 city4 city5
city1 . 0. 0. 1. 1
city2. 0. . 0. 0. 1
city3. 0. 0. . 2. 1
city4. 1. 0. 2 . 1.
city5. 1. 1 1. 1. .
I am working in python but it’s fine even if the solution is in matlab or R.
Thank you
In R, co-occurrence matrices are computed straightforwardly with table
and [t]crossprod
. We can compute the matrices by year and take the sum, like so:
con <- textConnection('
v1 city cluster year
0 "city1" 0 2000
1 "city2" 2 2000
2 "city3" 1 2000
3 "city4" 0 2000
4 "city5" 2 2000
0 "city1" 2 2001
1 "city2" 1 2001
2 "city3" 0 2001
3 "city4" 0 2001
4 "city5" 0 2001
0 "city1" 1 2002
1 "city2" 2 2002
2 "city3" 0 2002
3 "city4" 0 2002
4 "city5" 1 2002
')
d <- read.table(con, header = TRUE)
close(con)
x <- with(d, Reduce(`+`, apply(table(city, cluster, year), 3L, tcrossprod, simplify = FALSE)))
x
city
city city1 city2 city3 city4 city5
city1 3 0 0 1 1
city2 0 3 0 0 1
city3 0 0 3 2 1
city4 1 0 2 3 1
city5 1 1 1 1 3
There are threes on the diagonal because cities match themselves every year. If you prefer, say, zeros on the diagonal, then you can add:
diag(x) <- 0
If you don’t like the redundant annotation with "city", then you can add:
dimnames(x) <- unname(dimnames(x))
And if you want to store the result as a formally symmetric, formally sparse matrix, then you can add:
library(Matrix)
x <- as(x, "CsparseMatrix")
x
5 x 5 sparse Matrix of class "dsCMatrix"
city1 city2 city3 city4 city5
city1 . . . 1 1
city2 . . . . 1
city3 . . . 2 1
city4 1 . 2 . 1
city5 1 1 1 1 .
In R
, we may use crossprod
with table
– just paste
the ‘year’, ‘cluster’ to a single string, get the table
with city
and apply crossprod
on the output, then modify the diag
onal value by assigning it to 0
out <- `diag<-`(crossprod(with(df1, table(paste(year, cluster), city))), 0)
-output
out
city
city city1 city2 city3 city4 city5
city1 0 0 0 1 1
city2 0 0 0 0 1
city3 0 0 0 2 1
city4 1 0 2 0 1
city5 1 1 1 1 0
If we need a sparse option
library(Matrix)
> Matrix(out)
5 x 5 sparse Matrix of class "dsCMatrix"
city
city city1 city2 city3 city4 city5
city1 . . . 1 1
city2 . . . . 1
city3 . . . 2 1
city4 1 . 2 . 1
city5 1 1 1 1 .
data
df1 <- structure(list(v1 = c(0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 3L, 4L,
0L, 1L, 2L, 3L, 4L), city = c("city1", "city2", "city3", "city4",
"city5", "city1", "city2", "city3", "city4", "city5", "city1",
"city2", "city3", "city4", "city5"), cluster = c(0L, 2L, 1L,
0L, 2L, 2L, 1L, 0L, 0L, 0L, 1L, 2L, 0L, 0L, 1L), year = c(2000L,
2000L, 2000L, 2000L, 2000L, 2001L, 2001L, 2001L, 2001L, 2001L,
2002L, 2002L, 2002L, 2002L, 2002L)), class = "data.frame",
row.names = c(NA,
-15L))