Remove duplicate numbers separated by a symbol in a string using Hive's REGEXP_REPLACE

Question:

I have a spark dataframe with a string column that includes numbers separated by ;, for example: 862;1595;17;862;49;862;19;100;17;49, I would like to remove the duplicated numbers, leaving the following: 862;1595;17;49;19;100

As far as patterns go I have tried

  1. "\b(\d+(?:\.\d+)?) ([^;]+); (?=.*\b\1 \2\b)
  2. (?<=b1:.*)b(w+):?
  3. \b(+)\b(?=.*?\b1\b)
  4. (b[^,]+)(?=.*, *1(?:,|$)), *

But nothing has yielded what I need thus far.

Asked By: Cyrus Mohammadian

||

Answers:

Try the following query (to replace duplicate numbers in a string column):

SELECT  regexp_replace
        (
            your_column,
            '(?<=^|;)(?<num>.*?);(?=.*(?<=;)\k<num>(?=;|$))',
            ''
        )

FROM table;
Answered By: RomanPerekhrest
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.