why numpy.in1d() works more faster in my example but my object?

Question:

The 2 tables I have to deal with in my object like this:

TABLE1:

code_material yield
1000100001 1000
1000100002 500
1000100003 1024

where code_material is a ten-digit number whose value does not exceed 4e+9

TABLE2:

code_material_manufacturing input/output code_material qty
1000100001 1000154210 IN 100
1000100001 1123484257 IN 100
1000100001 1000100001 OUT 50

What I want to do is get code_material from table 1 and then search the first column of table 2 to find its index.
And for some values of ‘code_material’ in TABLE1 and ‘input/output’ in TABLE2 may have the character ‘-‘ be included, like’-1000100001′, indicating that they are half-finished products.

I use str as the dtypes for both cols, and np.in1d() to do like this:

# read data from excel
TABLE1 = pd.read_excel(path1, 
                       dtype={'code_material':str, 'yield':float})
TABLE2 = pd.read_excel(path2,
                       dtype={'code_material':str, 'input/output':str,
                              'code_material':str, 'qty':float}

# convert to numpy array
_output = TABLE1['code_material'].to_numpy()
output = _output[~np.char.count(_output, '-')!=0]  # To remove half-finished

table2 = TABLE2.to_numpy()

# some operations to help me find those outputs, not casts.
_idx_notproduction = np.argwhere(table2[:, 2]=='IN')
idx_notproduction = np.argwhere(np.in1d(output, table2[_idx_notproduction, 1]))

# operating segment
j = 0
output = output.tolist()

while j < len(output):
  production = output[j]
  idx_in_table2 = np.argwhere(table2[:, 0] == production)
  # find those input casts
  idx_input = idx_in_table2[:-1]  # Sliced to prevent production from counting itself in.
  input = table2[idx_input, 1][~np.char.count(table2[idx_input, 1], '-')!=0]
  
  idx = np.inid(input, table2[:, 0])  #  here's the in1d that confuses me.
  

It takes about 0.00423s each time.

But when I tried a similar instance, I found that np.in1d() ran almost one order of magnitude faster than I had in the object (about 0.000563s each time). Here is my example:

arr1 = np.random.randint(1, 3e+9, (1, 5), dtype=np.int64)     # average of 5 codes per search
arr2 = np.random.randint(1, 3e+9, (1, 1170), dtype=np.int64)  # len(TABLE2)=1170 in object
arr1, arr2 = arr1.astype(str), arr2.astype(str)
cost = 0
for i in a:
  s = time.perf_counter()  #For the purpose of timing
  idx = np.in1d(arr1, arr2)
  cost += time.perf_counter() - s

print(cost/len(a))

I would like to ask what causes such a big difference in speed between the two in1d()? Is it possible to use this cause to optimize my code in the object to this speed?

Asked By: CangWangu

||

Answers:

Here is an answer build from previously posted comments as requested by the OP:

The problem is certainly that TABLE2.to_numpy() results in a Numpy arrays containing pure-Python objets. Such objects are very inefficient (both time and memory space). You need to select 1 specific column and then convert it to a Numpy array. The operation you use will only be reasonably fast if all the dataframe columns are of the same type. Besides, note that comparing string is expensive as indicated by @hpaulj. To compare "1000100001" with another string, Numpy will compare each of the 10 character using a basic loop while comparing an integer with another take only about 1 instruction.

Besides, note that Pandas always stores strings in pure-Python objets. AFAIK, Numpy needs to update the reference counting for each object and care about locking/releasing the GIL, not to mention memory indirections are required and strings are generally stored using Unicode which tends to be often more expensive (such to additional checks). All of this is far much expensive than comparing integers. Please reconsider the need to use strings. You can use a sentinel if needed (eg. negative integers) and even map them to a set of predefined strings.

Last but not least, note that Pandas supports a type called category. It is usually significantly faster than plain strings when the number of unique strings is significantly smaller than the number of rows.

Answered By: Jérôme Richard
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.