create sub-directories and files from pandas dataframe
Question:
Having this dataframe at hand:
data = {'user': [7, 7, 7, 7, 7, 7, 7, 11, 11, 11],
'session_id': [15, 15, 15, 15, 31, 31, 31, 43, 43, 43],
'logtime': ['2016-04-13 07:58:40','2016-04-13 07:58:41','2016-04-13 07:58:42',
'2016-04-13 07:58:43','2016-04-01 20:29:37','2016-04-01 20:29:42',
'2016-04-01 20:29:47','2016-03-30 06:21:59','2016-03-30 06:22:04',
'2016-03-30 06:22:09'],
'lat': [41.1872084,41.1870716,41.1869719,41.1868664,41.1471521,
41.1472466,41.1473038,41.2372125,41.2371444,41.2369725],
'lon': [-8.6038931,-8.6037318,-8.6036908,-8.6036423,-8.5878757,
-8.5874314,-8.586632,-8.6720773,-8.6721269,-8.6718833]}
d = pd.DataFrame(data)
d
user session_id logtime lat lon
0 7 15 2016-04-13 07:58:40 41.187208 -8.603893
1 7 15 2016-04-13 07:58:41 41.187072 -8.603732
2 7 15 2016-04-13 07:58:42 41.186972 -8.603691
3 7 15 2016-04-13 07:58:43 41.186866 -8.603642
4 7 31 2016-04-01 20:29:37 41.147152 -8.587876
5 7 31 2016-04-01 20:29:42 41.147247 -8.587431
6 7 31 2016-04-01 20:29:47 41.147304 -8.586632
7 11 43 2016-03-30 06:21:59 41.237212 -8.672077
8 11 43 2016-03-30 06:22:04 41.237144 -8.672127
9 11 43 2016-03-30 06:22:09 41.236973 -8.671883
And I want to:
-
Create a sub-directory (in current working dir), for each user.
-
Within each user’s sub-directory, I would create 1 CSV
file for each session of that user.
-
Write to each file, session’s logtime, lat, lon
(without session ID), named these files in the format file1.csv, file2.csv
etc.
-
Then next user, until all users.
Expected output
So that the final directory structure and file contents is in the form (showing file content):
Data/
├── 11
│ └── file1.csv
| logtime,lat,lon
| 2016-03-30 06:21:59,41.2372125,-8.6720773
| 2016-03-30 06:22:04,41.2371444,-8.6721269
| 2016-03-30 06:22:09,41.2369725,-8.6718833
└── 7
├── file1.csv
| logtime,lat,lon
| 2016-04-13 07:58:40,41.187208,-8.603893
| 2016-04-13 07:58:41,41.187072,-8.603732
| 2016-04-13 07:58:42,41.186972,-8.603691
| 2016-04-13 07:58:43,41.186866,-8.603642
└── file2.csv
logtime,lat,lon
2016-04-01 20:29:37,41.147152,-8.587876
2016-04-01 20:29:42,41.147247,-8.587431
2016-04-01 20:29:47,41.147304,-8.586632
Answers:
This could be done with os.makedirs
and groupby
:
import os
# make the data folder if needed, change the path if needed
base_folder = '/Data'
os.makedirs(base_folder, exist_ok=True)
for (user_id,sess_id), data in df.groupby(['user', 'session_id']):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
filename = f'{user_fodler}/file_{session_id}.csv'
data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)
Note this will save file under session_id
. If you want to name as you wanted, then you can do two groupby; something like this:
for user_id, user_data in df.groupby('user'):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
for file_id, (sess_id, data) in user_data.groupby('session_id'):
filname = f'{user_folder}/file_{file_id}.csv'
....
Another possible solution:
# Create folders, assuming current working directory as root
for folder in d['user'].unique():
os.makedirs(str(folder), exist_ok=True)
((d.groupby('user')
.apply(lambda x: (x.assign(id = x.groupby('session_id').ngroup()+1))))
.groupby(['user', 'session_id'])
.apply(lambda y: y.iloc[:, 2:(len(y.columns)-1)]
.to_csv(os.path.join(
os.getcwd(), str(y['user'].unique()[0]),
f'file{str(y.id.unique()[0])}.csv'), index=False)))
Having this dataframe at hand:
data = {'user': [7, 7, 7, 7, 7, 7, 7, 11, 11, 11],
'session_id': [15, 15, 15, 15, 31, 31, 31, 43, 43, 43],
'logtime': ['2016-04-13 07:58:40','2016-04-13 07:58:41','2016-04-13 07:58:42',
'2016-04-13 07:58:43','2016-04-01 20:29:37','2016-04-01 20:29:42',
'2016-04-01 20:29:47','2016-03-30 06:21:59','2016-03-30 06:22:04',
'2016-03-30 06:22:09'],
'lat': [41.1872084,41.1870716,41.1869719,41.1868664,41.1471521,
41.1472466,41.1473038,41.2372125,41.2371444,41.2369725],
'lon': [-8.6038931,-8.6037318,-8.6036908,-8.6036423,-8.5878757,
-8.5874314,-8.586632,-8.6720773,-8.6721269,-8.6718833]}
d = pd.DataFrame(data)
d
user session_id logtime lat lon
0 7 15 2016-04-13 07:58:40 41.187208 -8.603893
1 7 15 2016-04-13 07:58:41 41.187072 -8.603732
2 7 15 2016-04-13 07:58:42 41.186972 -8.603691
3 7 15 2016-04-13 07:58:43 41.186866 -8.603642
4 7 31 2016-04-01 20:29:37 41.147152 -8.587876
5 7 31 2016-04-01 20:29:42 41.147247 -8.587431
6 7 31 2016-04-01 20:29:47 41.147304 -8.586632
7 11 43 2016-03-30 06:21:59 41.237212 -8.672077
8 11 43 2016-03-30 06:22:04 41.237144 -8.672127
9 11 43 2016-03-30 06:22:09 41.236973 -8.671883
And I want to:
-
Create a sub-directory (in current working dir), for each user.
-
Within each user’s sub-directory, I would create 1
CSV
file for each session of that user. -
Write to each file, session’s
logtime, lat, lon
(without session ID), named these files in the formatfile1.csv, file2.csv
etc. -
Then next user, until all users.
Expected output
So that the final directory structure and file contents is in the form (showing file content):
Data/
├── 11
│ └── file1.csv
| logtime,lat,lon
| 2016-03-30 06:21:59,41.2372125,-8.6720773
| 2016-03-30 06:22:04,41.2371444,-8.6721269
| 2016-03-30 06:22:09,41.2369725,-8.6718833
└── 7
├── file1.csv
| logtime,lat,lon
| 2016-04-13 07:58:40,41.187208,-8.603893
| 2016-04-13 07:58:41,41.187072,-8.603732
| 2016-04-13 07:58:42,41.186972,-8.603691
| 2016-04-13 07:58:43,41.186866,-8.603642
└── file2.csv
logtime,lat,lon
2016-04-01 20:29:37,41.147152,-8.587876
2016-04-01 20:29:42,41.147247,-8.587431
2016-04-01 20:29:47,41.147304,-8.586632
This could be done with os.makedirs
and groupby
:
import os
# make the data folder if needed, change the path if needed
base_folder = '/Data'
os.makedirs(base_folder, exist_ok=True)
for (user_id,sess_id), data in df.groupby(['user', 'session_id']):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
filename = f'{user_fodler}/file_{session_id}.csv'
data.drop(['user', 'session_id'], axis=1).to_csv(filename, index=False)
Note this will save file under session_id
. If you want to name as you wanted, then you can do two groupby; something like this:
for user_id, user_data in df.groupby('user'):
user_folder = f'{base_folder}/{user_id}'
os.makedirs(user_folder, exist_ok=True)
for file_id, (sess_id, data) in user_data.groupby('session_id'):
filname = f'{user_folder}/file_{file_id}.csv'
....
Another possible solution:
# Create folders, assuming current working directory as root
for folder in d['user'].unique():
os.makedirs(str(folder), exist_ok=True)
((d.groupby('user')
.apply(lambda x: (x.assign(id = x.groupby('session_id').ngroup()+1))))
.groupby(['user', 'session_id'])
.apply(lambda y: y.iloc[:, 2:(len(y.columns)-1)]
.to_csv(os.path.join(
os.getcwd(), str(y['user'].unique()[0]),
f'file{str(y.id.unique()[0])}.csv'), index=False)))