Split a dataframe by length of characters
Question:
I have a table like
--------------------|
Val
--------------------|
1, M A ,HELLO,WORLD |
2, M 1A,HELLO WORLD |
---------------------
I want to split the above dataframe so it contains the three columns below.
----------------------
a | b | c |
----------------------
1 | M A | HELLO,WORLD|
1 | M 1A| HELLO WORLD|
----------------------
I have used the below code, but it does not work as expected. Is there a way to contain all characters after 5 characters in column c, etc. and character 2-5 in column b?
df = df.withColumn('Splitted', F.split(hadf37dr_df['Val'], ',')).withColumn('a', F.col('Splitted')[0]).withColumn('b', F.col('Splitted')[1]).withColumn('c', F.col('Splitted')[2])
Answers:
You can use df.Val.str.extract(...)
to split a string column into multiple columns.
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
If you are having this issue because the file you are reading is actually formatted in fixed-width instead of comma-separated for example you might want to use read_fwf(...)
from the pandas library to avoid this problem in the first place.
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
Why split a dataframe by length of characters to achieve a goal for splitting based on the first occurrence of comma? If so,
df = pd.DataFrame([i.split(',', 1) for i in df.Val], columns=['b', 'c'])
Gives #
b c
0 M A HELLO,WORLD
1 M 1A HELLO WORLD
If you want to this in PySpark, you need F.concat_ws which concatenates elements of a list and return it as string, and you need F.slice which slices elements of a list from ‘head’ with a specified ‘length’. Because you need to set the length, you need the size of the array which you can have with F.size.
(
df
.withColumn('Splitted', F.split(df['val'], ','))
.withColumn('a', F.col('Splitted')[0])
.withColumn('b', F.col('Splitted')[1])
.withColumn(
'c',
F.concat_ws(
',',
F.slice('Splitted', 3, F.size('Splitted') - 2)))
).show()
Output:
+-------------------+--------------------+---+-----+-----------+
| val| Splitted| a| b| c|
+-------------------+--------------------+---+-----+-----------+
|1, M A ,HELLO,WORLD|[1, M A , HELLO,...| 1| M A |HELLO,WORLD|
|2, M 1A,HELLO WORLD|[2, M 1A, HELLO ...| 2| M 1A|HELLO WORLD|
+-------------------+--------------------+---+-----+-----------+
If splitting the strings from your original table df with column ‘Val’ is just based on positions (as you write), you can slice them as follows to obtain df1
df1 = pd.DataFrame(columns = ['a', 'b', 'c'],
data = [[row[0], row[2:7], row[8:]] for row in df.Val]])
I have a table like
--------------------|
Val
--------------------|
1, M A ,HELLO,WORLD |
2, M 1A,HELLO WORLD |
---------------------
I want to split the above dataframe so it contains the three columns below.
----------------------
a | b | c |
----------------------
1 | M A | HELLO,WORLD|
1 | M 1A| HELLO WORLD|
----------------------
I have used the below code, but it does not work as expected. Is there a way to contain all characters after 5 characters in column c, etc. and character 2-5 in column b?
df = df.withColumn('Splitted', F.split(hadf37dr_df['Val'], ',')).withColumn('a', F.col('Splitted')[0]).withColumn('b', F.col('Splitted')[1]).withColumn('c', F.col('Splitted')[2])
You can use df.Val.str.extract(...)
to split a string column into multiple columns.
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
If you are having this issue because the file you are reading is actually formatted in fixed-width instead of comma-separated for example you might want to use read_fwf(...)
from the pandas library to avoid this problem in the first place.
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
Why split a dataframe by length of characters to achieve a goal for splitting based on the first occurrence of comma? If so,
df = pd.DataFrame([i.split(',', 1) for i in df.Val], columns=['b', 'c'])
Gives #
b c
0 M A HELLO,WORLD
1 M 1A HELLO WORLD
If you want to this in PySpark, you need F.concat_ws which concatenates elements of a list and return it as string, and you need F.slice which slices elements of a list from ‘head’ with a specified ‘length’. Because you need to set the length, you need the size of the array which you can have with F.size.
(
df
.withColumn('Splitted', F.split(df['val'], ','))
.withColumn('a', F.col('Splitted')[0])
.withColumn('b', F.col('Splitted')[1])
.withColumn(
'c',
F.concat_ws(
',',
F.slice('Splitted', 3, F.size('Splitted') - 2)))
).show()
Output:
+-------------------+--------------------+---+-----+-----------+
| val| Splitted| a| b| c|
+-------------------+--------------------+---+-----+-----------+
|1, M A ,HELLO,WORLD|[1, M A , HELLO,...| 1| M A |HELLO,WORLD|
|2, M 1A,HELLO WORLD|[2, M 1A, HELLO ...| 2| M 1A|HELLO WORLD|
+-------------------+--------------------+---+-----+-----------+
If splitting the strings from your original table df with column ‘Val’ is just based on positions (as you write), you can slice them as follows to obtain df1
df1 = pd.DataFrame(columns = ['a', 'b', 'c'],
data = [[row[0], row[2:7], row[8:]] for row in df.Val]])