dask-dataframe

populate SQL database with dask dataframe and dump into a file

populate SQL database with dask dataframe and dump into a file Question: reproduce the error and the use case on this colab I have multiple large tables that I read and analyze through Dask (dataframe). After doing analysis, I would like to push them into a local database (in this case sqlite engine through sqlalchemy …

Total answers: 1

dask-ml preprocessing raise AttributeError

dask-ml preprocessing raise AttributeError Question: I use Dask dataframe and dask-ml to manipulate my data. When I use dask-ml Min-max scaler, I get this error. Is there a way to prevent this error and make it work? import dask.dataframe as dd from dask_ml.preprocessing import MinMaxScaler df = dd.read_csv(‘path to csv’, parse_dates=[‘CREATED_AT’] , dtype={‘ODI_UPDATED_AT’: ‘object’}) scaler …

Total answers: 1

DASK: merge throws error when one side's key is NA whereas pd.merge works

DASK: merge throws error when one side's key is NA whereas pd.merge works Question: I have these sample dataframes: tdf1 = pd.DataFrame([{"id": 1, "val": 4}, {"id": 2, "val": 5}, {"id": 3, "val": 6}, {"id": pd.NA, "val": 7}, {"id": 4, "val": 8}]) tdf2 = pd.DataFrame([{"some_id": 1, "name": "Josh"}, {"some_id": 3, "name": "Jake"}]) pd.merge(tdf1, tdf2, how="left", left_on="id", …

Total answers: 1

Create a Dataframe in Dask

Create a Dataframe in Dask Question: I’m just starting using Dask as a possible replacement (?) of pandas. The first think that hit me is that i can’t seem to find a way to create a dataframe from a couple lists/arrays. In regular pandas i just do: pd.DataFrame({‘a’:a,’b’:b,…}) but i can’t find an equivalent way …

Total answers: 1

create a multiple CSV in Dask with Differnet name

create a multiple CSV in Dask with Differnet name Question: I am using dask.dataframe.to_csv to write a CSV. I was expecting 13 CSVs. but it writes only 2 CSVs (rewriting the old one which i do not want) for i in range(13): df_temp=df1.groupby("temp").get_group(unique_cases[i]) df_temp.to_csv(path_store_csv+"*.csv") I also tried this but it did not work: for i …

Total answers: 1

Creating and Merging Multiple Datasets Does Not Fit Into Memory, Use Dask?

Creating and Merging Multiple Datasets Does Not Fit Into Memory, Use Dask? Question: I’m not quite sure how to ask this question, but I need some clarification on how to make use of Dask’s ability to "handle datasets that don’t fit into memory", because I’m a little confused on how it works from the CREATION …

Total answers: 1

randomly accessing a row of Dask dataframe is taking a long time

randomly accessing a row of Dask dataframe is taking a long time Question: I have a Dask dataframe of 100 million rows of data. I am trying to iterate over this dataframe without loading the entire dataframe to RAM. For an experiment, trying to access row of index equal to 1. %time dask_df.loc[1].compute() The time …

Total answers: 4

Dealing with huge pandas data frames

Dealing with huge pandas data frames Question: I have a huge database (of 500GB or so) an was able to put it in pandas. The databasse contains something like 39705210 observations. As you can imagine, python has hard times even opening it. Now, I am trying to use Dask in order to export it to …

Total answers: 1

How does dask know variable states before it runs map_partitions?

How does dask know variable states before it runs map_partitions? Question: In the dask code below I set x with 1 and 2 right before executing two map_partitions. The result seems fine, however I don’t fully understand it. If dask waits to run the two map_partitions only when it finds the compute(), and at the …

Total answers: 1

Replacing existing column in dask map_partitions gives SettingWithCopyWarning

Replacing existing column in dask map_partitions gives SettingWithCopyWarning Question: I’m replacing column id2 in a dask dataframe using map_partitions. The result is that the values are replaced but with a pandas warning. What is this warning and how to apply the .loc suggestion in the example below? pdf = pd.DataFrame({ ‘dummy2’: [10, 10, 10, 20, …

Total answers: 1