Get unique list of chars used in a given column

Question

l have a csv file that l process with pandas. The column is called raw_value l want to retrieve the unique chars in this column.

x=df.manual_raw_value.unique()

allows to retrieve unique rows. However, l’m looking to retrieve the whole chars in this columns .
which is :
alphabet= 6 , 3 5 1 8 V O T R E A 2 . é è / :

   raw_value
    6,35
    11,68
    VOTRE
    AVEL AR VRO
    2292
    questions.
    nb
    les
    937,99
    à
    et
    TTC
    1
    620
    Echéance
    vos
    ROB21
    Pièce
    AGRIAL
    désignation
    des
    taux
    13s
    2
    par
    le
    mois,
    32
    21/07/2016
    FR
    au
    0
    téléphonique
    BROYEUR
    et
    ST
    TVA
    de
    des
    ECHEANCIER
    à
    ne
    lieu
    481,67
    N°0016
    de
    ministère
    de
    20/11/2015
    Si
    vous
    59
    cas
    EUR
    3.19
    2
    contrôle
    assurances
    BAS
    et
    4423873
    renseignements
    6104219
    C9DECOMPTEDIVERS
    6635
    DE
    10825

EDIT_1

All the three solutions works perfectly.
l chose the second one

set(df.raw_value.apply(list).sum())

Hwever it returns some encoded char. Is it related to encoding ?
how to decode and display the real char . Here is what it prints

{' ',
 '!',
 '"',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ':',
 '=',
 '>',
 '?',
 '@',
 '_',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 'x82',
 'x87',
 'x94',
 'xa1',
 'xa7',
 'xaa',
 'xab',
 'xac',
 'xae',
 'xaf',
 'xb0',
 'xb4',
 'xb9',
 'xbb',
 'xc2',
 'xc3',
 'xe2'}

Asked By: vincent75

||

Source

Answer 1

You can first convert the raw value to a string list, then stack to a char df and get unique elements.

df.applymap(list).raw_value.apply(pd.Series).stack().unique()
Out[620]: array(['6', ',', '3', ..., 'ô', 'D', 'M'], dtype=object)

You can also do this by converting the raw value to a list, concat the list and then get the set of the list.

set(df.raw_value.apply(list).sum())

A yet simpler approach is to directly concat raw values to a string and then apply set on it because string is essentially a list.

set(df.raw_value.sum())

Note, the first approach will include nan in the results while the second and third approach exclude nan.

Answered By: Allen Qin

Answer 2

I know this question has been answered already, but here is another way to answer it:

x = set(list(' '.join(manual_raw_value.values)))

Answered By: Ziyad Moraished

Answer 3

There is another way:

from functools  import reduce

reduce(lambda a, b: set((*a,*b)), df['raw_value'].apply(np.array))

Answered By: jedlin

Answer 4

If your dataframe is large but there are some characters you know appear in your column, you can speed this up using strip() to remove those characters. Also, you can convert your column to a string and add the strings, instead of adding lists. For example, the following code assumes you know the digits 0123456789 appear in your column.

set(list(df['raw_value'].str.strip('01234566789').sum()))

Answered By: user7868

Answer 5

Converting raw values to a list and concatenating them is not a good idea I think. It needs a quite big memory and time to process. Declaring set and updating it would be way faster:

unique_characters = set()
df.raw_value.apply(lambda x: unique_characters.update(x))

Answered By: Zorojuro

Get unique list of chars used in a given column

Question:

Answers: