Write dataframe groupby id and date into a one csv file

Question:

Following my previous question on dynamicallay creating sub-folders and writing files to sub-folders, I realised I needed further help, having applied to real dataset.

Suppose this is the dataframe that I have:

data = {'user': [7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 11, 11, 11], 
 'session_id': [15, 15, 15, 15, 31, 31, 31, 20, 20,
                 20, 43, 43, 43], 
 'logtime': ['2016-04-13 07:58:40', '2016-04-13 07:58:41','2016-04-13 07:58:42',
             '2016-04-13 07:58:43', '2016-04-01 20:29:37', '2016-04-01 20:29:42', 
             '2016-04-01 20:29:47', '2016-04-13 21:04:27', '2016-04-13 21:04:28', 
             '2016-04-13 21:04:29', '2016-03-30 06:21:59', '2016-03-30 06:22:04', 
             '2016-03-30 06:22:09'], 
 'lat': [41.1872084, 41.1870716, 41.1869719, 41.1868664, 41.1471521, 41.1472466, 
          41.1473038, 40.0121007, 40.0121084, 40.0121534, 41.2372125, 41.2371444, 41.2369725], 
 'lon': [-8.6038931, -8.6037318, -8.6036908, -8.6036423, -8.5878757, -8.5874314, -8.586632, 
         -8.5992162, -8.5992024, -8.5991788, -8.6720773, -8.6721269, -8.6718833]}

df = pd.DataFrame(data)
df
    user    session_id  logtime               lat          lon
0   7            15    2016-04-13 07:58:40  41.187208   -8.603893
1   7            15    2016-04-13 07:58:41  41.187072   -8.603732
2   7            15    2016-04-13 07:58:42  41.186972   -8.603691
3   7            15    2016-04-13 07:58:43  41.186866   -8.603642
4   7            31    2016-04-01 20:29:37  41.147152   -8.587876
5   7            31    2016-04-01 20:29:42  41.147247   -8.587431
6   7            31    2016-04-01 20:29:47  41.147304   -8.586632
7   7            20    2016-04-13 21:04:27  40.012101   -8.599216
8   7            20    2016-04-13 21:04:28  40.012108   -8.599202
9   7            20    2016-04-13 21:04:29  40.012153   -8.599179
10  11           43    2016-03-30 06:21:59  41.237212   -8.672077
11  11           43    2016-03-30 06:22:04  41.237144   -8.672127
12  11           43    2016-03-30 06:22:09  41.236973   -8.671883

I am re-organising this dataframe by creating sub-folder for each user. And then creating a CSV file in that user sub-folder. The file should contain user’s session log,logtime, lat, lon. However, I want to write sessions I user covered in the same day into 1 file like file1.csv.

This answer to my previous question is fine:

import os
import pandas as pd

df = pd.read_csv('mydata.csv', parse_dates=['logtime'])

# make base dir
base_folder = 'Data'
os.makedirs(base_folder, exist_ok=True)

for user_id, user_data in df.groupby('user'):
    user_folder = f'{base_folder}/{user_id}'
    os.makedirs(user_folder, exist_ok=True)

    for file_id, (sess_id, data) in enumerate(user_data.groupby(['session_id'])):
        filename = f'{user_folder}/file{file_id + 1}.csv'
        data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)

Which creates 1 file for each session of a user, resulting in dir structure:

$ tree  Data/
Data/
├── 11
│   └── file1.csv
└── 7
    ├── file1.csv
    ├── file2.csv
    └── file3.csv

2 directories, 4 files

File contents:

$ cat Data/7/file1.csv 
logtime,lat,lon
2016-04-13 07:58:40,41.1872084,-8.6038931
2016-04-13 07:58:41,41.1870716,-8.6037318
2016-04-13 07:58:42,41.1869719,-8.6036908
2016-04-13 07:58:43,41.1868664,-8.6036423

$ cat Data/7/file2.csv 
logtime,lat,lon
2016-04-13 21:04:27,40.0121007,-8.5992162
2016-04-13 21:04:28,40.0121084,-8.5992024
2016-04-13 21:04:29,40.0121534,-8.5991788

$ cat Data/7/file3.csv 
logtime,lat,lon
2016-04-01 20:29:37,41.1471521,-8.5878757
2016-04-01 20:29:42,41.1472466,-8.5874314
2016-04-01 20:29:47,41.1473038,-8.586632

Since session 15 and 20 of user 7 were covered in one day, I would combined these into file1.csv altogether, maintaining the time order (logs of session 15 then 20).

To do this, I modified the code above to:

for ..
     # now group by 'session_id', 'logtime'
    for file_id, (sess_id, data) in enumerate(user_data.groupby(['session_id', 'logtime'])):
        filename = f'{user_folder}/file{file_id + 1}.csv'
        data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)

Giving:

$ tree Data/
Data/
├── 11
│   ├── file1.csv
│   ├── file2.csv
│   └── file3.csv
└── 7
    ├── file10.csv
    ├── file1.csv
    ├── file2.csv
    ├── file3.csv
    ├── file4.csv
    ├── file5.csv
    ├── file6.csv
    ├── file7.csv
    ├── file8.csv
    └── file9.csv

2 directories, 13 files

Creating 1 file for each row of user 7.

Required:

$ tree  Data/
Data/
├── 11
│   └── file1.csv
|   |       2016-03-30 06:21:59,41.237212,-8.672077
|   |       2016-03-30 06:22:04,41.237144,-8.672127
|   |       2016-03-30 06:22:09,41.236973,-8.671883
└── 7
    ├── file1.csv
    |       2016-04-13 07:58:40,41.187208,-8.603893
    |       2016-04-13 07:58:41,41.187072,-8.603732
    |       2016-04-13 07:58:42,41.186972,-8.603691
    |       2016-04-13 07:58:43,41.186866,-8.603642
    |       2016-04-13 21:04:27,40.012101,-8.599216
    |       2016-04-13 21:04:28,40.012108,-8.599202
    |       2016-04-13 21:04:29,40.012153,-8.599179
    └── file2.csv
            2016-04-01 20:29:37,41.147152,-8.587876
            2016-04-01 20:29:42,41.147247,-8.587431
            2016-04-01 20:29:47,41.147304,-8.586632

2 directories, 3 files
Asked By: arilwan

||

Answers:

This should solve your problem. You can’t combine two sessions at one day because you groupby session_id so in each iteration you only can handle one session. Since you are only interested in sessions that took place at the same day, groupby the day in the 2nd step.

base_folder = 'Data'
os.makedirs(base_folder, exist_ok=True)

for user_id, user_data in df.groupby('user'):
    user_folder = f'{base_folder}/{user_id}'
    os.makedirs(user_folder, exist_ok=True)
    
    for file_id, (sess_id, data) in enumerate(user_data.groupby(user_data['logtime'].dt.date)):
        filename = f'{user_folder}/file{file_id + 1}.csv'
        data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)
Answered By: Rabinzel
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.