How do I group NAICS Industry codes under one broader two digit subgroup to get the total # of loans within that one industry in Python?

Question:

I have a dataset that has a variable, NAICS Industry, represented by a 6 digit #, I want to get this # narrowed down to the first two digits, so I can combine industries for a broader view. After I get the industry # narrowed down to two digits instead of 6; I want to use value counts to count the total # of loans that fall within that NAICS industry code. Can someone please help. I have attached pictures for reference.

Reference of NAICS Industry codes; as you can see some of the codes have the same first two digits; I want to group these codes under one broader subgroup to get the total # of loans within that one industry.

Asked By: Nateisha

||

Answers:

The best approach depends on the data type of the NAICS data (which I can’t tell from the screenshot alone) and assumptions about the number of digits.

Assuming that the dataset contains only six-digit NAICS codes in integer format (that is, df['NAICS'].dtype is int64 or similar), the first two digits can be obtained by dividing the NAICS code by 10000 using integer division:

df['NAICS_sector'] = df['NAICS'] // 10000

Note that you must use // (integer division) and not / (floating-point division).

If the NAICS codes are in the dataframe in string format (that is, df['NAICS'].dtype says object), you can use string manipulation instead:

df['NAICS_sector'] = df['NAICS'].str.slice(stop=2)

Setting stop=2 means that the first two characters are returned from each entry. The parameters of the slice method are explained in the official Pandas documentation.

Finally, if your dataset contains integers but you cannot guarantee they all have the same length, you’ll want to use string manipulation anyway,
by converting the column to a string and then using the second sample.

After all this is done, you can group using the new NAICS_sector column.

Answered By: nanofarad
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.