Get unique list of chars used in a given column
Question:
l have a csv
file that l process with pandas
. The column is called raw_value
l want to retrieve the unique chars in this column.
x=df.manual_raw_value.unique()
allows to retrieve unique rows. However, l’m looking to retrieve the whole chars in this columns .
which is :
alphabet= 6 , 3 5 1 8 V O T R E A 2 . é è / :
raw_value
6,35
11,68
VOTRE
AVEL AR VRO
2292
questions.
nb
les
937,99
à
et
TTC
1
620
Echéance
vos
ROB21
Pièce
AGRIAL
désignation
des
taux
13s
2
par
le
mois,
32
21/07/2016
FR
au
0
téléphonique
BROYEUR
et
ST
TVA
de
des
ECHEANCIER
à
ne
lieu
481,67
N°0016
de
ministère
de
20/11/2015
Si
vous
59
cas
EUR
3.19
2
contrôle
assurances
BAS
et
4423873
renseignements
6104219
C9DECOMPTEDIVERS
6635
DE
10825
EDIT_1
All the three solutions works perfectly.
l chose the second one
set(df.raw_value.apply(list).sum())
Hwever it returns some encoded char. Is it related to encoding ?
how to decode and display the real char . Here is what it prints
{' ',
'!',
'"',
'%',
'&',
"'",
'(',
')',
'*',
'+',
',',
'-',
'.',
'/',
'0',
'1',
'2',
'3',
'4',
'5',
'6',
'7',
'8',
'9',
':',
'=',
'>',
'?',
'@',
'_',
'a',
'b',
'c',
'd',
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z',
'x82',
'x87',
'x94',
'xa1',
'xa7',
'xaa',
'xab',
'xac',
'xae',
'xaf',
'xb0',
'xb4',
'xb9',
'xbb',
'xc2',
'xc3',
'xe2'}
Answers:
You can first convert the raw value to a string list, then stack to a char df and get unique elements.
df.applymap(list).raw_value.apply(pd.Series).stack().unique()
Out[620]: array(['6', ',', '3', ..., 'ô', 'D', 'M'], dtype=object)
You can also do this by converting the raw value to a list, concat the list and then get the set of the list.
set(df.raw_value.apply(list).sum())
A yet simpler approach is to directly concat raw values to a string and then apply set on it because string is essentially a list.
set(df.raw_value.sum())
Note, the first approach will include nan in the results while the second and third approach exclude nan.
I know this question has been answered already, but here is another way to answer it:
x = set(list(' '.join(manual_raw_value.values)))
There is another way:
from functools import reduce
reduce(lambda a, b: set((*a,*b)), df['raw_value'].apply(np.array))
If your dataframe is large but there are some characters you know appear in your column, you can speed this up using strip()
to remove those characters. Also, you can convert your column to a string and add the strings, instead of adding lists. For example, the following code assumes you know the digits 0123456789 appear in your column.
set(list(df['raw_value'].str.strip('01234566789').sum()))
Converting raw values to a list and concatenating them is not a good idea I think. It needs a quite big memory and time to process. Declaring set and updating it would be way faster:
unique_characters = set()
df.raw_value.apply(lambda x: unique_characters.update(x))
l have a csv
file that l process with pandas
. The column is called raw_value
l want to retrieve the unique chars in this column.
x=df.manual_raw_value.unique()
allows to retrieve unique rows. However, l’m looking to retrieve the whole chars in this columns .
which is :
alphabet= 6 , 3 5 1 8 V O T R E A 2 . é è / :
raw_value
6,35
11,68
VOTRE
AVEL AR VRO
2292
questions.
nb
les
937,99
à
et
TTC
1
620
Echéance
vos
ROB21
Pièce
AGRIAL
désignation
des
taux
13s
2
par
le
mois,
32
21/07/2016
FR
au
0
téléphonique
BROYEUR
et
ST
TVA
de
des
ECHEANCIER
à
ne
lieu
481,67
N°0016
de
ministère
de
20/11/2015
Si
vous
59
cas
EUR
3.19
2
contrôle
assurances
BAS
et
4423873
renseignements
6104219
C9DECOMPTEDIVERS
6635
DE
10825
EDIT_1
All the three solutions works perfectly.
l chose the second one
set(df.raw_value.apply(list).sum())
Hwever it returns some encoded char. Is it related to encoding ?
how to decode and display the real char . Here is what it prints
{' ',
'!',
'"',
'%',
'&',
"'",
'(',
')',
'*',
'+',
',',
'-',
'.',
'/',
'0',
'1',
'2',
'3',
'4',
'5',
'6',
'7',
'8',
'9',
':',
'=',
'>',
'?',
'@',
'_',
'a',
'b',
'c',
'd',
'e',
'f',
'g',
'h',
'i',
'j',
'k',
'l',
'm',
'n',
'o',
'p',
'q',
'r',
's',
't',
'u',
'v',
'w',
'x',
'y',
'z',
'x82',
'x87',
'x94',
'xa1',
'xa7',
'xaa',
'xab',
'xac',
'xae',
'xaf',
'xb0',
'xb4',
'xb9',
'xbb',
'xc2',
'xc3',
'xe2'}
You can first convert the raw value to a string list, then stack to a char df and get unique elements.
df.applymap(list).raw_value.apply(pd.Series).stack().unique()
Out[620]: array(['6', ',', '3', ..., 'ô', 'D', 'M'], dtype=object)
You can also do this by converting the raw value to a list, concat the list and then get the set of the list.
set(df.raw_value.apply(list).sum())
A yet simpler approach is to directly concat raw values to a string and then apply set on it because string is essentially a list.
set(df.raw_value.sum())
Note, the first approach will include nan in the results while the second and third approach exclude nan.
I know this question has been answered already, but here is another way to answer it:
x = set(list(' '.join(manual_raw_value.values)))
There is another way:
from functools import reduce
reduce(lambda a, b: set((*a,*b)), df['raw_value'].apply(np.array))
If your dataframe is large but there are some characters you know appear in your column, you can speed this up using strip()
to remove those characters. Also, you can convert your column to a string and add the strings, instead of adding lists. For example, the following code assumes you know the digits 0123456789 appear in your column.
set(list(df['raw_value'].str.strip('01234566789').sum()))
Converting raw values to a list and concatenating them is not a good idea I think. It needs a quite big memory and time to process. Declaring set and updating it would be way faster:
unique_characters = set()
df.raw_value.apply(lambda x: unique_characters.update(x))