create sub-directories and files from pandas dataframe

Question:

Having this dataframe at hand:

data = {'user': [7, 7, 7, 7, 7, 7, 7, 11, 11, 11],
 'session_id': [15, 15, 15, 15, 31, 31, 31, 43, 43, 43],
 'logtime': ['2016-04-13 07:58:40','2016-04-13 07:58:41','2016-04-13 07:58:42',
            '2016-04-13 07:58:43','2016-04-01 20:29:37','2016-04-01 20:29:42',
            '2016-04-01 20:29:47','2016-03-30 06:21:59','2016-03-30 06:22:04',
            '2016-03-30 06:22:09'],
 'lat': [41.1872084,41.1870716,41.1869719,41.1868664,41.1471521,
         41.1472466,41.1473038,41.2372125,41.2371444,41.2369725],
 'lon': [-8.6038931,-8.6037318,-8.6036908,-8.6036423,-8.5878757,
         -8.5874314,-8.586632,-8.6720773,-8.6721269,-8.6718833]}

d = pd.DataFrame(data)
d
   user session_id   logtime          lat         lon
0   7     15    2016-04-13 07:58:40  41.187208  -8.603893
1   7     15    2016-04-13 07:58:41  41.187072  -8.603732
2   7     15    2016-04-13 07:58:42  41.186972  -8.603691
3   7     15    2016-04-13 07:58:43  41.186866  -8.603642
4   7     31    2016-04-01 20:29:37  41.147152  -8.587876
5   7     31    2016-04-01 20:29:42  41.147247  -8.587431
6   7     31    2016-04-01 20:29:47  41.147304  -8.586632
7   11    43    2016-03-30 06:21:59  41.237212  -8.672077
8   11    43    2016-03-30 06:22:04  41.237144  -8.672127
9   11    43    2016-03-30 06:22:09  41.236973  -8.671883

And I want to:

  • Create a sub-directory (in current working dir), for each user.

  • Within each user’s sub-directory, I would create 1 CSV file for each session of that user.

  • Write to each file, session’s logtime, lat, lon (without session ID), named these files in the format file1.csv, file2.csv etc.

  • Then next user, until all users.

Expected output

So that the final directory structure and file contents is in the form (showing file content):

Data/
├── 11
│   └── file1.csv
|          logtime,lat,lon
|          2016-03-30 06:21:59,41.2372125,-8.6720773
|          2016-03-30 06:22:04,41.2371444,-8.6721269
|          2016-03-30 06:22:09,41.2369725,-8.6718833 
└── 7
    ├── file1.csv
    |      logtime,lat,lon
    |      2016-04-13 07:58:40,41.187208,-8.603893
    |      2016-04-13 07:58:41,41.187072,-8.603732
    |      2016-04-13 07:58:42,41.186972,-8.603691
    |      2016-04-13 07:58:43,41.186866,-8.603642
    └── file2.csv
           logtime,lat,lon
           2016-04-01 20:29:37,41.147152,-8.587876
           2016-04-01 20:29:42,41.147247,-8.587431
           2016-04-01 20:29:47,41.147304,-8.586632
Asked By: arilwan

||

Answers:

This could be done with os.makedirs and groupby:

import os

# make the data folder if needed, change the path if needed
base_folder = '/Data'
os.makedirs(base_folder, exist_ok=True)

for (user_id,sess_id), data in df.groupby(['user', 'session_id']):
    user_folder = f'{base_folder}/{user_id}'
    os.makedirs(user_folder, exist_ok=True)

    filename = f'{user_fodler}/file_{session_id}.csv'
    data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)

Note this will save file under session_id. If you want to name as you wanted, then you can do two groupby; something like this:

for user_id, user_data in df.groupby('user'):
    user_folder = f'{base_folder}/{user_id}'
    os.makedirs(user_folder, exist_ok=True)

    for file_id, (sess_id, data) in user_data.groupby('session_id'):
        filname = f'{user_folder}/file_{file_id}.csv'
        ....
Answered By: Quang Hoang

Another possible solution:

# Create folders, assuming current working directory as root
for folder in d['user'].unique():
  os.makedirs(str(folder), exist_ok=True)

((d.groupby('user')
  .apply(lambda x: (x.assign(id = x.groupby('session_id').ngroup()+1))))
    .groupby(['user', 'session_id'])
    .apply(lambda y: y.iloc[:, 2:(len(y.columns)-1)]
           .to_csv(os.path.join(
            os.getcwd(), str(y['user'].unique()[0]),
            f'file{str(y.id.unique()[0])}.csv'), index=False)))
Answered By: PaulS
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.