why numpy.in1d() works more faster in my example but my object?

Question

The 2 tables I have to deal with in my object like this:

TABLE1:

code_material	yield
1000100001	1000
1000100002	500
1000100003	1024

where code_material is a ten-digit number whose value does not exceed 4e+9

TABLE2:

code_material_manufacturing	input/output	code_material	qty
1000100001	1000154210	IN	100
1000100001	1123484257	IN	100
1000100001	1000100001	OUT	50

What I want to do is get code_material from table 1 and then search the first column of table 2 to find its index.
And for some values of ‘code_material’ in TABLE1 and ‘input/output’ in TABLE2 may have the character ‘-‘ be included, like’-1000100001′, indicating that they are half-finished products.

I use str as the dtypes for both cols, and np.in1d() to do like this:

# read data from excel
TABLE1 = pd.read_excel(path1, 
                       dtype={'code_material':str, 'yield':float})
TABLE2 = pd.read_excel(path2,
                       dtype={'code_material':str, 'input/output':str,
                              'code_material':str, 'qty':float}

# convert to numpy array
_output = TABLE1['code_material'].to_numpy()
output = _output[~np.char.count(_output, '-')!=0]  # To remove half-finished

table2 = TABLE2.to_numpy()

# some operations to help me find those outputs, not casts.
_idx_notproduction = np.argwhere(table2[:, 2]=='IN')
idx_notproduction = np.argwhere(np.in1d(output, table2[_idx_notproduction, 1]))

# operating segment
j = 0
output = output.tolist()

while j < len(output):
  production = output[j]
  idx_in_table2 = np.argwhere(table2[:, 0] == production)
  # find those input casts
  idx_input = idx_in_table2[:-1]  # Sliced to prevent production from counting itself in.
  input = table2[idx_input, 1][~np.char.count(table2[idx_input, 1], '-')!=0]
  
  idx = np.inid(input, table2[:, 0])  #  here's the in1d that confuses me.

It takes about 0.00423s each time.

But when I tried a similar instance, I found that np.in1d() ran almost one order of magnitude faster than I had in the object (about 0.000563s each time). Here is my example:

arr1 = np.random.randint(1, 3e+9, (1, 5), dtype=np.int64)     # average of 5 codes per search
arr2 = np.random.randint(1, 3e+9, (1, 1170), dtype=np.int64)  # len(TABLE2)=1170 in object
arr1, arr2 = arr1.astype(str), arr2.astype(str)
cost = 0
for i in a:
  s = time.perf_counter()  #For the purpose of timing
  idx = np.in1d(arr1, arr2)
  cost += time.perf_counter() - s

print(cost/len(a))

I would like to ask what causes such a big difference in speed between the two in1d()? Is it possible to use this cause to optimize my code in the object to this speed?

Asked By: CangWangu

||

Source

Answer 1

Here is an answer build from previously posted comments as requested by the OP:

The problem is certainly that TABLE2.to_numpy() results in a Numpy arrays containing pure-Python objets. Such objects are very inefficient (both time and memory space). You need to select 1 specific column and then convert it to a Numpy array. The operation you use will only be reasonably fast if all the dataframe columns are of the same type. Besides, note that comparing string is expensive as indicated by @hpaulj. To compare "1000100001" with another string, Numpy will compare each of the 10 character using a basic loop while comparing an integer with another take only about 1 instruction.

Besides, note that Pandas always stores strings in pure-Python objets. AFAIK, Numpy needs to update the reference counting for each object and care about locking/releasing the GIL, not to mention memory indirections are required and strings are generally stored using Unicode which tends to be often more expensive (such to additional checks). All of this is far much expensive than comparing integers. Please reconsider the need to use strings. You can use a sentinel if needed (eg. negative integers) and even map them to a set of predefined strings.

Last but not least, note that Pandas supports a type called category. It is usually significantly faster than plain strings when the number of unique strings is significantly smaller than the number of rows.

Answered By: Jérôme Richard

why numpy.in1d() works more faster in my example but my object?

Question:

Answers: