hive

Remove duplicate numbers separated by a symbol in a string using Hive's REGEXP_REPLACE

Remove duplicate numbers separated by a symbol in a string using Hive's REGEXP_REPLACE Question: I have a spark dataframe with a string column that includes numbers separated by ;, for example: 862;1595;17;862;49;862;19;100;17;49, I would like to remove the duplicated numbers, leaving the following: 862;1595;17;49;19;100 As far as patterns go I have tried "\b(\d+(?:\.\d+)?) ([^;]+); (?=.*\b\1 …

Total answers: 1

Avro, Hive or HBASE – What to use for 10 mio. records daily?

Avro, Hive or HBASE – What to use for 10 mio. records daily? Question: I have the following requirements: i need to process per day around 20.000 elements (lets call them baskets) which generate each between 100 and 1.000 records (lets call them products in basket). A single record has about 10 columns, each row …

Total answers: 1

Schema for pyarrow.ParquetDataset > partition columns

Schema for pyarrow.ParquetDataset > partition columns Question: I have a pandas DataFrame: import pandas as pd df = pd.DataFrame(data={"col1": [1, 2], "col2": [3.0, 4.0], "col3": ["foo", "bar"]}) Using s3fs: from s3fs import S3FileSystem s3fs = S3FileSystem(**kwargs) I can write this as a parquet dataset import pyarrow as pa import pyarrow.parquet as pq tbl = pa.Table.from_pandas(df) …

Total answers: 2