How to combine columns into a new table – Python or R

Question

Scenario:

If I have this table, let’s call it df:

survey_answer_1___1	survey_answer_1___2	survey_answer_1___3	survey_answer_2___1
1	1	0	1
0	1	0	0
0	0	0	1
1	1	1	0

Using R or Python, how do I split and transform df into survey_answer_1 and survey_answer_2 like this:

survey_answer_1:

1	2	3
2	3	1

survey_answer_2:

1	2
2	0

Where the column names of the new tables are extracted from df column names after '___'. The values in the new cells is the count of 1s in each column in df. This should be done automatically (tables should not be "hard-coded"), as there are many other columns in my data file that this should be applied on as well.

split() can be used to extract the numbers after '___' for column names. I tried implementing the rest using a dictionary, but it is not working.

Asked By: fnaber

||

Source

Answer 1

For answer 1, you could do the following:

# grab correct columns
df_answer_1 = df[[col for col in df.columns if col.startswith('survey_answer_1')]] 

# change column names
df_answer_1.columns = [col[-1] for col in df_answer_1.columns]

# sum up columns
answer_1_sums = df_answer_1.sum()

You can do the same for answer 2.

Answered By: jprebys

Answer 2

Using R / tidyverse, first dplyr::summarize() all columns to sums; then tidyr::pivot_longer(); then split() by survey_answer; then purrr::map() over the resulting list to drop all-NA columns :

library(dplyr)
library(tidyr)
library(purrr)

survey_dfs <- df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  map((d) select(d, where((col) !all(is.na(col))) & !survey_answer))

survey_dfs

$survey_answer_1
# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1

$survey_answer_2
# A tibble: 1 × 2
    `1`   `2`
  <dbl> <dbl>
1     2     0

This gives you a named list of dataframes, which is best practice in most cases. If you really want the resulting dataframes loose in the global environment, you can replace the map() call with an assign() call within purrr::iwalk():

df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  iwalk((d, dname) {
    d <- select(d, where((col) !all(is.na(col))) & !survey_answer)
    assign(dname, d, pos = 1)
  })

survey_answer_1

# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1

Answered By: zephryl

Answer 3

Assuming data is in csv:

survey_answer_1___1,survey_answer_1___2,survey_answer_1___3,survey_answer_2___1,survey_answer_2___2
1,1,0,1,0
0,1,0,0,0
0,0,0,1,0
1,1,1,0,0

Read data:

import csv

with open('input.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    df = [row for row in reader]

Process data:

from collections import defaultdict, Counter

dd = defaultdict(Counter)
for row in df:
    for k, v in row.items():
        key1, key2 = k.split('___')
        dd[key1][int(key2)] += int(v)

Print result:

for k in dd:
    print(k, sorted(dd[k].items()))

Answered By: fen1x

Answer 4

Here an R example where the new columns can be arbitrary values

df <- as.data.frame(matrix(c(1,0,0,1,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0), 4, 5, dim=list(
  1:4, paste0("survey_answer_", c(1,1,1,2,2), "__", c(1,2,3,1,5)) )))

df 
#>   survey_answer_1__1 survey_answer_1__2 survey_answer_1__3 survey_answer_2__1
#> 1                  1                  1                  0                  1
#> 2                  0                  1                  0                  0
#> 3                  0                  0                  0                  1
#> 4                  1                  1                  1                  0
#>   survey_answer_2__5
#> 1                  0
#> 2                  0
#> 3                  0
#> 4                  0

var <- Map(c, names(df), strsplit(names(df), "__"))

result <- tapply(var, sapply(var,"[", 2), (x) 
       setNames(colSums(df[sapply(x,"[",1)]) , sapply(x,"[",3)))

#to assign the resuilt list to new  datafrae variables:
list2env(result, environment())


survey_answer_1
#> 1 2 3 
#> 2 3 1
survey_answer_2
#> 1 5 
#> 2 0

Answered By: Ric Villalba

Answer 5

In Python:

# raw data
df = {"survey_answer_1___1":[1,0,0,1], "survey_answer_1___2":[1,1,0,1], "survey_answer_1___3":[0,0,0,1], "survey_answer_2___1":[1,0,1,0], "survey_answer_2___2":[0,0,0,0]}
# sum up the answers
for k in df:
    sum_df[k] = sum(df[k])
# extract answer_1
survey_answer_1 = {[k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_1")]}
survey_answer_1
{'1': 2, '2': 3, '3': 1}
# extract answer_2
survey_answer_2 = {k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_2")}
survey_answer_2
{'1': 2, '2': 0}

Answered By: Paul Smith

How to combine columns into a new table – Python or R

Question:

Answers: