Split a dataframe by length of characters

Question:

I have a table like

--------------------|
Val
--------------------|
1, M A ,HELLO,WORLD |
2, M 1A,HELLO WORLD |
---------------------

I want to split the above dataframe so it contains the three columns below.

----------------------
a | b   | c          |
----------------------
1 | M A | HELLO,WORLD|
1 | M 1A| HELLO WORLD|
----------------------

I have used the below code, but it does not work as expected. Is there a way to contain all characters after 5 characters in column c, etc. and character 2-5 in column b?

df = df.withColumn('Splitted', F.split(hadf37dr_df['Val'], ',')).withColumn('a', F.col('Splitted')[0]).withColumn('b', F.col('Splitted')[1]).withColumn('c', F.col('Splitted')[2])
Asked By: lunbox

||

Answers:

You can use df.Val.str.extract(...) to split a string column into multiple columns.
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html

If you are having this issue because the file you are reading is actually formatted in fixed-width instead of comma-separated for example you might want to use read_fwf(...) from the pandas library to avoid this problem in the first place.
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

Answered By: Erik

Why split a dataframe by length of characters to achieve a goal for splitting based on the first occurrence of comma? If so,

df = pd.DataFrame([i.split(',', 1) for i in df.Val], columns=['b', 'c'])

Gives #

   b             c
0  M A   HELLO,WORLD
1  M 1A  HELLO WORLD
Answered By: Bhargav

If you want to this in PySpark, you need F.concat_ws which concatenates elements of a list and return it as string, and you need F.slice which slices elements of a list from ‘head’ with a specified ‘length’. Because you need to set the length, you need the size of the array which you can have with F.size.

(
    df
    .withColumn('Splitted', F.split(df['val'], ','))
    .withColumn('a', F.col('Splitted')[0])
    .withColumn('b', F.col('Splitted')[1])
    .withColumn(
        'c',
        F.concat_ws(
            ',',
            F.slice('Splitted', 3,  F.size('Splitted') - 2)))
).show()

Output:

+-------------------+--------------------+---+-----+-----------+
|                val|            Splitted|  a|    b|          c|
+-------------------+--------------------+---+-----+-----------+
|1, M A ,HELLO,WORLD|[1,  M A , HELLO,...|  1| M A |HELLO,WORLD|
|2, M 1A,HELLO WORLD|[2,  M 1A, HELLO ...|  2| M 1A|HELLO WORLD|
+-------------------+--------------------+---+-----+-----------+

If splitting the strings from your original table df with column ‘Val’ is just based on positions (as you write), you can slice them as follows to obtain df1

df1 = pd.DataFrame(columns = ['a', 'b', 'c'], 
                   data = [[row[0], row[2:7], row[8:]] for row in df.Val]])
Answered By: Westfalenmats
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.