How to combine columns into a new table – Python or R

Question:

Scenario:

If I have this table, let’s call it df:

survey_answer_1___1 survey_answer_1___2 survey_answer_1___3 survey_answer_2___1 survey_answer_2___2
1 1 0 1 0
0 1 0 0 0
0 0 0 1 0
1 1 1 0 0

Using R or Python, how do I split and transform df into survey_answer_1 and survey_answer_2 like this:

survey_answer_1:

1 2 3
2 3 1

survey_answer_2:

1 2
2 0

Where the column names of the new tables are extracted from df column names after '___'. The values in the new cells is the count of 1s in each column in df. This should be done automatically (tables should not be "hard-coded"), as there are many other columns in my data file that this should be applied on as well.

split() can be used to extract the numbers after '___' for column names. I tried implementing the rest using a dictionary, but it is not working.

Asked By: fnaber

||

Answers:

For answer 1, you could do the following:

# grab correct columns
df_answer_1 = df[[col for col in df.columns if col.startswith('survey_answer_1')]] 

# change column names
df_answer_1.columns = [col[-1] for col in df_answer_1.columns]

# sum up columns
answer_1_sums = df_answer_1.sum()

You can do the same for answer 2.

Answered By: jprebys

Using R / tidyverse, first dplyr::summarize() all columns to sums; then tidyr::pivot_longer(); then split() by survey_answer; then purrr::map() over the resulting list to drop all-NA columns :

library(dplyr)
library(tidyr)
library(purrr)

survey_dfs <- df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  map((d) select(d, where((col) !all(is.na(col))) & !survey_answer))

survey_dfs 
$survey_answer_1
# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1

$survey_answer_2
# A tibble: 1 × 2
    `1`   `2`
  <dbl> <dbl>
1     2     0

This gives you a named list of dataframes, which is best practice in most cases. If you really want the resulting dataframes loose in the global environment, you can replace the map() call with an assign() call within purrr::iwalk():

df %>% 
  summarize(across(everything(), sum)) %>% 
  pivot_longer(
    everything(), 
    names_to = c("survey_answer", ".value"), 
    names_sep = "___"
  ) %>% 
  split(.$survey_answer, drop = TRUE) %>% 
  iwalk((d, dname) {
    d <- select(d, where((col) !all(is.na(col))) & !survey_answer)
    assign(dname, d, pos = 1)
  })

survey_answer_1
# A tibble: 1 × 3
    `1`   `2`   `3`
  <dbl> <dbl> <dbl>
1     2     3     1
Answered By: zephryl

Assuming data is in csv:

survey_answer_1___1,survey_answer_1___2,survey_answer_1___3,survey_answer_2___1,survey_answer_2___2
1,1,0,1,0
0,1,0,0,0
0,0,0,1,0
1,1,1,0,0

Read data:

import csv

with open('input.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    df = [row for row in reader]

Process data:

from collections import defaultdict, Counter

dd = defaultdict(Counter)
for row in df:
    for k, v in row.items():
        key1, key2 = k.split('___')
        dd[key1][int(key2)] += int(v)

Print result:

for k in dd:
    print(k, sorted(dd[k].items()))
Answered By: fen1x

Here an R example where the new columns can be arbitrary values

df <- as.data.frame(matrix(c(1,0,0,1,1,1,0,1,0,0,0,1,1,0,1,0,0,0,0,0), 4, 5, dim=list(
  1:4, paste0("survey_answer_", c(1,1,1,2,2), "__", c(1,2,3,1,5)) )))

df 
#>   survey_answer_1__1 survey_answer_1__2 survey_answer_1__3 survey_answer_2__1
#> 1                  1                  1                  0                  1
#> 2                  0                  1                  0                  0
#> 3                  0                  0                  0                  1
#> 4                  1                  1                  1                  0
#>   survey_answer_2__5
#> 1                  0
#> 2                  0
#> 3                  0
#> 4                  0

var <- Map(c, names(df), strsplit(names(df), "__"))

result <- tapply(var, sapply(var,"[", 2), (x) 
       setNames(colSums(df[sapply(x,"[",1)]) , sapply(x,"[",3)))

#to assign the resuilt list to new  datafrae variables:
list2env(result, environment())


survey_answer_1
#> 1 2 3 
#> 2 3 1
survey_answer_2
#> 1 5 
#> 2 0
Answered By: Ric Villalba

In Python:

# raw data
df = {"survey_answer_1___1":[1,0,0,1], "survey_answer_1___2":[1,1,0,1], "survey_answer_1___3":[0,0,0,1], "survey_answer_2___1":[1,0,1,0], "survey_answer_2___2":[0,0,0,0]}
# sum up the answers
for k in df:
    sum_df[k] = sum(df[k])
# extract answer_1
survey_answer_1 = {[k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_1")]}
survey_answer_1
{'1': 2, '2': 3, '3': 1}
# extract answer_2
survey_answer_2 = {k[-1]:sum_df[k] for k in sum_df if k.startswith("survey_answer_2")}
survey_answer_2
{'1': 2, '2': 0}
Answered By: Paul Smith
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.