make 3d numpy array using for loop in python

Question:

I have training data with 2 dimension. (200 results of 4 features)

I proved 100 different applications with 10 repetition resulting 1000 csv files.

I want to stack each csv results for machine learning.
But I don’t know how.

each of my csv files look like below.

test1.csv to numpy array data

[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]

I tried below python code.

path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
cnt=0
for f in csv_files:
    cnt +=1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    if cnt==1:
        a = np.array(preprocess(f))
        b = np.array(app)
    else:
        a = np.vstack((a, np.array(preprocess(f))))
        b = np.append(b,app)
print(a)
print(b)

preprocess function returns df.to_numpy results for each csv files.

My expectation was like below. a(1000, 200, 4)

[[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]],
[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]],
...
[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]]

However, I’m getting this. a(200000, 4)

[[0 'crc32_pclmul' 445 0]
 [0 'crc32_pclmul' 270 4096]
 [0 'crc32_pclmul' 234 8192]
 ...
 [249 'intel_pmt' 272 4096]
 [249 'intel_pmt' 224 8192]
 [249 'intel_pmt' 268 12288]]

I want to access each csv results using a[0] to a[1000] each sub-array looks like (200,4)
How can I solve the problem? I’m quite lost

Asked By: mangosrk

||

Answers:

make a new list and append each to that new list after reading.
(make new list outside the loop)

Answered By: AKSHAY KUMAR

You have to change from vstack to stack

la=[]
lb=[]
for f in csv_files:
    cnt +=1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    la.append(preprocess(f))
    lb.append(app)
a=np.stack(la, axis=0)
b=np.array(lb)

vstack can stack along rows only but stack function can stack along a new axis.

Answered By: MSS

Well, yes, that is what vstack (and append) does. It merges things on the same axis (rows axis).

a1=np.arange(10).reshape(2,5)
# [[0,1,2,3,4],
#  [5,6,7,8,9]]
a2=np.arange(10,20).reshape(2,5)
# [[10, 11, 12, 13, 14],
#  [15, 16, 17, 18, 19]])
np.vstack((a1,a2))
# [[ 0,  1,  2,  3,  4],
#  [ 5,  6,  7,  8,  9],
#  [10, 11, 12, 13, 14],
#  [15, 16, 17, 18, 19]])

b1=np.arange(5)
b2=np.arange(5,10)
np.append(b1,b2)
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

If you expect (from those examples), to append along a new axis, then you need to add it, or to use more flexible stack.

np.vstack(([a1],[a2]))
#array([[[ 0,  1,  2,  3,  4],
#       [ 5,  6,  7,  8,  9]],
#
#      [[10, 11, 12, 13, 14],
#       [15, 16, 17, 18, 19]]])

Or, in the case of 1d, use vstack instead of append

np.vstack((b1,b2))
#array([[0, 1, 2, 3, 4],
#       [5, 6, 7, 8, 9]])

But more importantly, you shouldn’t be doing this in the first place inside a loop. Each of those functions (stack, vstack, append) recreates a new array.

It would be probably more efficient to just append all your np.array(preprocess(f)) and b = np.array(app) to a pure python list, and call stack and vstack only once you’ve read them all.

Or, even better, just append directly the preprocess(f) and the app inside python list. And call np.array only after the loop, and the whole thing.

So, something like

la=[]
lb=[]
for f in csv_files:
    cnt +=1
    seperator = '_'
    app = os.path.basename(f).split(seperator, 1)[0]

    la.append(preprocess(f))
    lb.append(app)
a=np.array(la)
b=np.array(lb)
Answered By: chrslg
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.