bigdata

Efficiently editing large input file based on simple lookup with python dataframes

Efficiently editing large input file based on simple lookup with python dataframes Question: I have a very large txt file (currently 6Gb, 50m rows) with a structure like this… **id amount batch transaction sequence** a2asd 12.6 123456 12394891237124 0 bs9dj 0.6 123456 12394891237124 1 etc… I read the file like this… inputFileDf = pd.read_csv(filename, header=None, …

Total answers: 1

How to use multiprocessing Pool when evaluating many images using scikit-learn pipeline?

How to use multiprocessing Pool when evaluating many images using scikit-learn pipeline? Question: I used a GridSearchCV pipeline for training several different image classifiers in scikit-learn. In the pipeline I used two stages, scaler and classifier. The training run successfully, and this is what turned out to be the best hyper-parameter setting: Pipeline(steps=[(‘scaler’, MinMaxScaler()), (‘classifier’, …

Total answers: 2

Transform spark data frame

Transform spark data frame Question: I have a data frame in spark with the following format. +———-+———+ |Column 1 | Values | +———-+———:+ | A | value1 | | B | value2 | | C | value2 | | A | value1 | | B | value3 | | C | value1 | | A …

Total answers: 1

Memory Error when parsing a large number of files

Memory Error when parsing a large number of files Question: I am parsing 6k csv files to merge them into one. I need this for their joint analysis and training of the ML model. There are too many files and my computer ran out of memory by simply concatenating them. S = ‘’ for f …

Total answers: 1

How to get names of scheduled queries in bigquery

How to get names of scheduled queries in bigquery Question: Using a python client to connect with bigquery, how can we get names of all the scheduled queries present in that project? I tried following up with this link – https://cloud.google.com/bigquery/docs/reference/datatransfer/libraries But got no information on the names of the scheduled queries. Asked By: Runtime …

Total answers: 1

How to make python for loops faster

How to make python for loops faster Question: I have a list of dictionaries, like this: [{‘user’: ‘123456’, ‘db’: ‘db1’, ‘size’: ‘8628’} {‘user’: ‘123456’, ‘db’: ‘db1’, ‘size’: ‘7168’} {‘user’: ‘123456’, ‘db’: ‘db1’, ‘size’: ‘38160’} {‘user’: ‘222345’, ‘db’: ‘db3’, ‘size’: ‘8628’} {‘user’: ‘222345’, ‘db’: ‘db3’, ‘size’: ‘8628’} {‘user’: ‘222345’, ‘db’: ‘db5’, ‘size’: ‘840’} {‘user’: ‘34521’, ‘db’: …

Total answers: 4

Creating a long masking list (Python)

Creating a long masking list (Python) Question: Here is what I have: long_list = a very long list of integer values (6M+ entries) wanted_list = a list of integer values that are of interest (70K entries) What I need: mask_list = a list of booleans of the same length as long_list, describing whether each element …

Total answers: 3

Summing numbers in two diffrent .txt file in Python

Summing numbers in two diffrent .txt file in Python Question: I am currently trying to sum two .txt files containing each other over 35 millions value and put the result in a third file. File 1 : 2694.28 2694.62 2694.84 2695.17 File 2 : 1.483429484776452 2.2403221757269196 1.101004844694236 1.6119626937837102 File 3 : 2695.76343 2696.86032 2695.941 2696.78196 …

Total answers: 2

Masking the email and phone number in PySpark

Masking the email and phone number in PySpark Question: I want to mask the email – the first and last character before ‘@’ remain unmasked and the rest should be masked. For phone number, the first and the last digit remains unmasked and the rest will be masked. Asked By: Kishan Yadav || Source Answers: …

Total answers: 4

Merge two different dataframes in pyspark

Merge two different dataframes in pyspark Question: I have two different dataframes, one is date combinations, and one is city pairs: df_date_combinations: +——————-+——————-+ | fs_date| ss_date| +——————-+——————-+ |2022-06-01T00:00:00|2022-06-02T00:00:00| |2022-06-01T00:00:00|2022-06-03T00:00:00| |2022-06-01T00:00:00|2022-06-04T00:00:00| +——————-+——————-+ city pairs: +———+————–+———+————–+ |fs_origin|fs_destination|ss_origin|ss_destination| +———+————–+———+————–+ | TLV| NYC| NYC| TLV| | TLV| ROM| ROM| TLV| | TLV| BER| BER| TLV| +———+————–+———+————–+ I want to …

Total answers: 2