Split string column to multiple columns in R/Python

Question:

I’m trying to split the following string

str = "A (B) C, D (E) F, G, H, a (b) c"

into 9 separate strings like:
A, B, C, D, E, {F, G, H}, a, b, c

I’ve tried

str = "A (B) C, D (E) F, G, H, a (b) c"
strr = stri_split_regex(str, "\(.*?\)")
strr

and it returns me strr as
A, {C, D}, {F, G, H, a}, c

The actual string I’m working with looks like

str2 = "Independent Spirit Award  (Co-Nominee)  for Anomalisa, Academy Award  (Co-Nominee)  for Anomalisa, Independent Spirit Award  (Co-Winner)  for Synecdoche, New York, Independent Spirit Award  (Nominee)  for Synecdoche, New York"

and I want that to be separated into

Independent Spirit Award; Co-Nominee; for Anomalisa; Academy Award; Co-Nominee; for Anomalisa; Independent Spirit Award; Co-Winner; for Synecdoche, New York; Independent Spirit Award; Nominee; for Synecdoche, New York;

So I think what I need is to split the string so that each separation is done at the brackets, and the letters both inside and outside of the brackets are kept. There’s also a tricky part that the commas are placed irregularly, but that I only want the letter right after the closest comma of the next ‘(‘ is kept in a separate column.

Asked By: alexj

||

Answers:

This pattern splits by open or close paren, or the last comma before an open paren, as well as any adjacent whitespace.

For str:

library(stringi)

stri_split_regex(str, "\s*(\(|\)|,(?=[^,]+\)))\s*") 
[[1]]
[1] "A"       "B"       "C"       "D"       "E"       "F, G, H" "a"      
[8] "b"       "c"

For str2:

stri_split_regex(str2, "\s*(\(|\)|,(?=[^,]+\)))\s*") 
[[1]]
 [1] "Independent Spirit Award" "Co-Nominee"              
 [3] "for Anomalisa"            "Academy Award"           
 [5] "Co-Nominee"               "for Anomalisa"           
 [7] "Independent Spirit Award" "Co-Winner"               
 [9] "for Synecdoche, New York" "Independent Spirit Award"
[11] "Nominee"                  "for Synecdoche, New York"
Answered By: zephryl
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.