why numpy.in1d() works more faster in my example but my object?
Question:
The 2 tables I have to deal with in my object like this:
TABLE1:
code_material
yield
1000100001
1000
1000100002
500
1000100003
1024
where code_material is a ten-digit number whose value does not exceed 4e+9
TABLE2:
code_material_manufacturing
input/output
code_material
qty
1000100001
1000154210
IN
100
1000100001
1123484257
IN
100
1000100001
1000100001
OUT
50
What I want to do is get code_material from table 1 and then search the first column of table 2 to find its index.
And for some values of ‘code_material’ in TABLE1 and ‘input/output’ in TABLE2 may have the character ‘-‘ be included, like’-1000100001′, indicating that they are half-finished products.
I use str
as the dtypes for both cols, and np.in1d()
to do like this:
# read data from excel
TABLE1 = pd.read_excel(path1,
dtype={'code_material':str, 'yield':float})
TABLE2 = pd.read_excel(path2,
dtype={'code_material':str, 'input/output':str,
'code_material':str, 'qty':float}
# convert to numpy array
_output = TABLE1['code_material'].to_numpy()
output = _output[~np.char.count(_output, '-')!=0] # To remove half-finished
table2 = TABLE2.to_numpy()
# some operations to help me find those outputs, not casts.
_idx_notproduction = np.argwhere(table2[:, 2]=='IN')
idx_notproduction = np.argwhere(np.in1d(output, table2[_idx_notproduction, 1]))
# operating segment
j = 0
output = output.tolist()
while j < len(output):
production = output[j]
idx_in_table2 = np.argwhere(table2[:, 0] == production)
# find those input casts
idx_input = idx_in_table2[:-1] # Sliced to prevent production from counting itself in.
input = table2[idx_input, 1][~np.char.count(table2[idx_input, 1], '-')!=0]
idx = np.inid(input, table2[:, 0]) # here's the in1d that confuses me.
It takes about 0.00423s each time.
But when I tried a similar instance, I found that np.in1d()
ran almost one order of magnitude faster than I had in the object (about 0.000563s each time). Here is my example:
arr1 = np.random.randint(1, 3e+9, (1, 5), dtype=np.int64) # average of 5 codes per search
arr2 = np.random.randint(1, 3e+9, (1, 1170), dtype=np.int64) # len(TABLE2)=1170 in object
arr1, arr2 = arr1.astype(str), arr2.astype(str)
cost = 0
for i in a:
s = time.perf_counter() #For the purpose of timing
idx = np.in1d(arr1, arr2)
cost += time.perf_counter() - s
print(cost/len(a))
I would like to ask what causes such a big difference in speed between the two in1d()
? Is it possible to use this cause to optimize my code in the object to this speed?
Answers:
Here is an answer build from previously posted comments as requested by the OP:
The problem is certainly that TABLE2.to_numpy()
results in a Numpy arrays containing pure-Python objets. Such objects are very inefficient (both time and memory space). You need to select 1 specific column and then convert it to a Numpy array. The operation you use will only be reasonably fast if all the dataframe columns are of the same type. Besides, note that comparing string is expensive as indicated by @hpaulj. To compare "1000100001" with another string, Numpy will compare each of the 10 character using a basic loop while comparing an integer with another take only about 1 instruction.
Besides, note that Pandas always stores strings in pure-Python objets. AFAIK, Numpy needs to update the reference counting for each object and care about locking/releasing the GIL, not to mention memory indirections are required and strings are generally stored using Unicode which tends to be often more expensive (such to additional checks). All of this is far much expensive than comparing integers. Please reconsider the need to use strings. You can use a sentinel if needed (eg. negative integers) and even map them to a set of predefined strings.
Last but not least, note that Pandas supports a type called category
. It is usually significantly faster than plain strings when the number of unique strings is significantly smaller than the number of rows.
The 2 tables I have to deal with in my object like this:
TABLE1:
code_material | yield |
---|---|
1000100001 | 1000 |
1000100002 | 500 |
1000100003 | 1024 |
where code_material is a ten-digit number whose value does not exceed 4e+9
TABLE2:
code_material_manufacturing | input/output | code_material | qty |
---|---|---|---|
1000100001 | 1000154210 | IN | 100 |
1000100001 | 1123484257 | IN | 100 |
1000100001 | 1000100001 | OUT | 50 |
What I want to do is get code_material from table 1 and then search the first column of table 2 to find its index.
And for some values of ‘code_material’ in TABLE1 and ‘input/output’ in TABLE2 may have the character ‘-‘ be included, like’-1000100001′, indicating that they are half-finished products.
I use str
as the dtypes for both cols, and np.in1d()
to do like this:
# read data from excel
TABLE1 = pd.read_excel(path1,
dtype={'code_material':str, 'yield':float})
TABLE2 = pd.read_excel(path2,
dtype={'code_material':str, 'input/output':str,
'code_material':str, 'qty':float}
# convert to numpy array
_output = TABLE1['code_material'].to_numpy()
output = _output[~np.char.count(_output, '-')!=0] # To remove half-finished
table2 = TABLE2.to_numpy()
# some operations to help me find those outputs, not casts.
_idx_notproduction = np.argwhere(table2[:, 2]=='IN')
idx_notproduction = np.argwhere(np.in1d(output, table2[_idx_notproduction, 1]))
# operating segment
j = 0
output = output.tolist()
while j < len(output):
production = output[j]
idx_in_table2 = np.argwhere(table2[:, 0] == production)
# find those input casts
idx_input = idx_in_table2[:-1] # Sliced to prevent production from counting itself in.
input = table2[idx_input, 1][~np.char.count(table2[idx_input, 1], '-')!=0]
idx = np.inid(input, table2[:, 0]) # here's the in1d that confuses me.
It takes about 0.00423s each time.
But when I tried a similar instance, I found that np.in1d()
ran almost one order of magnitude faster than I had in the object (about 0.000563s each time). Here is my example:
arr1 = np.random.randint(1, 3e+9, (1, 5), dtype=np.int64) # average of 5 codes per search
arr2 = np.random.randint(1, 3e+9, (1, 1170), dtype=np.int64) # len(TABLE2)=1170 in object
arr1, arr2 = arr1.astype(str), arr2.astype(str)
cost = 0
for i in a:
s = time.perf_counter() #For the purpose of timing
idx = np.in1d(arr1, arr2)
cost += time.perf_counter() - s
print(cost/len(a))
I would like to ask what causes such a big difference in speed between the two in1d()
? Is it possible to use this cause to optimize my code in the object to this speed?
Here is an answer build from previously posted comments as requested by the OP:
The problem is certainly that TABLE2.to_numpy()
results in a Numpy arrays containing pure-Python objets. Such objects are very inefficient (both time and memory space). You need to select 1 specific column and then convert it to a Numpy array. The operation you use will only be reasonably fast if all the dataframe columns are of the same type. Besides, note that comparing string is expensive as indicated by @hpaulj. To compare "1000100001" with another string, Numpy will compare each of the 10 character using a basic loop while comparing an integer with another take only about 1 instruction.
Besides, note that Pandas always stores strings in pure-Python objets. AFAIK, Numpy needs to update the reference counting for each object and care about locking/releasing the GIL, not to mention memory indirections are required and strings are generally stored using Unicode which tends to be often more expensive (such to additional checks). All of this is far much expensive than comparing integers. Please reconsider the need to use strings. You can use a sentinel if needed (eg. negative integers) and even map them to a set of predefined strings.
Last but not least, note that Pandas supports a type called category
. It is usually significantly faster than plain strings when the number of unique strings is significantly smaller than the number of rows.