How do I group NAICS Industry codes under one broader two digit subgroup to get the total # of loans within that one industry in Python?
Question:
I have a dataset that has a variable, NAICS Industry, represented by a 6 digit #, I want to get this # narrowed down to the first two digits, so I can combine industries for a broader view. After I get the industry # narrowed down to two digits instead of 6; I want to use value counts to count the total # of loans that fall within that NAICS industry code. Can someone please help. I have attached pictures for reference.
Answers:
The best approach depends on the data type of the NAICS data (which I can’t tell from the screenshot alone) and assumptions about the number of digits.
Assuming that the dataset contains only six-digit NAICS codes in integer format (that is, df['NAICS'].dtype
is int64
or similar), the first two digits can be obtained by dividing the NAICS code by 10000 using integer division:
df['NAICS_sector'] = df['NAICS'] // 10000
Note that you must use //
(integer division) and not /
(floating-point division).
If the NAICS codes are in the dataframe in string format (that is, df['NAICS'].dtype
says object
), you can use string manipulation instead:
df['NAICS_sector'] = df['NAICS'].str.slice(stop=2)
Setting stop=2
means that the first two characters are returned from each entry. The parameters of the slice
method are explained in the official Pandas documentation.
Finally, if your dataset contains integers but you cannot guarantee they all have the same length, you’ll want to use string manipulation anyway,
by converting the column to a string and then using the second sample.
After all this is done, you can group using the new NAICS_sector
column.
I have a dataset that has a variable, NAICS Industry, represented by a 6 digit #, I want to get this # narrowed down to the first two digits, so I can combine industries for a broader view. After I get the industry # narrowed down to two digits instead of 6; I want to use value counts to count the total # of loans that fall within that NAICS industry code. Can someone please help. I have attached pictures for reference.
The best approach depends on the data type of the NAICS data (which I can’t tell from the screenshot alone) and assumptions about the number of digits.
Assuming that the dataset contains only six-digit NAICS codes in integer format (that is, df['NAICS'].dtype
is int64
or similar), the first two digits can be obtained by dividing the NAICS code by 10000 using integer division:
df['NAICS_sector'] = df['NAICS'] // 10000
Note that you must use //
(integer division) and not /
(floating-point division).
If the NAICS codes are in the dataframe in string format (that is, df['NAICS'].dtype
says object
), you can use string manipulation instead:
df['NAICS_sector'] = df['NAICS'].str.slice(stop=2)
Setting stop=2
means that the first two characters are returned from each entry. The parameters of the slice
method are explained in the official Pandas documentation.
Finally, if your dataset contains integers but you cannot guarantee they all have the same length, you’ll want to use string manipulation anyway,
by converting the column to a string and then using the second sample.
After all this is done, you can group using the new NAICS_sector
column.