How should I approach classifying these text fields into numeric?

Question

I have data that has a bunch of text entries that look like this: 7-8 business days, 10 days, 10-12 days, 2-3 weeks, 1 year, 8-12 days, 2 weeks, etc.

I want to train a model to convert these text entries to numeric. I assume this is a fairly easy problem for an NLP model to handle, but I am not too confident about which models to use.

I would like to take the larger of the two numbers if a range is specified. For instance, 2-3 weeks should become 21 and 8-10 days should become 10.

I figure I can manually code around 100 records to train the model. Can someone recommend an NLP model to use or even a script that I can edit? If this isn’t a good use case for NLP, please also advise.

(I am only practiced at the R programming language, but I can tinker around in Python if needed).

If all else fails is two break the data into separate columns using the hyphen as a delimiter using strsplit(as.character(df$column), "-"), and then evaluating it using if-statements: for example:

date_conversion = if_else(is.numeric(columnA) & columnB == "weeks", columnA*7, if_else(is.numeric(columnA) & columnB == "days", columnA, NA)

But ideally, I would like to train a model since I have a lot of data.

Some data using dput:

text_fields <- c("10 - 14 days", "2-3 Weeks", "10-12 days", "8 days", "8-12 days", 
                 "10 days", "7-10 days", "5 days", "7-10 days", "5-7 days", "10 days", 
                 "7 days", "7 - 10 days", "1 week", "5-7 days", "2 weeks", "2-4 weeks", 
                 "10 days", "7-10 days", "8-10 days", "1 week", "10 days", "8-10 days", 
                 "2 weeks", "10-12 days", "7-10 days", "2-3 weeks", "7-10 days", 
                 "10 days", "2 weeks", "8-12 days", "12 days", "10 Days", "7 Days", 
                 "2 weeks", "5-8 days", "8-12 days", "8-12 days", "10 days", "12-14 days", 
                 "10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days", 
                 "5-7 business days", "5-7 days", "5-7 business days", "7 days", 
                 "2-3 weeks", "7-10 days", "8-12 Days", "10 days", "10 days", 
                 "10 days", "10 days", "10 days", "14", "2 weeks", "10 business days", 
                 "2-3 weeks", "4 days", "1 month", "7-10 days", "8-12 days", "2-3 weeks", 
                 "3-5 days", "10 days", "3-5 days", "2-3 days", "2-3 days", "3-5", 
                 "5-7 days", "7-10 days", "5-7 days", "8-12 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2.5 weeks", "2 Weeks", "10-12 days", 
                 "10-12 days", "7-10 days", "7-10 days", "7-10 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2 weeks", "1 month", "1 month", "1 week"
)

Asked By: Stephen Poole

||

Source

Answer 1

If you have a simple list as an input, you can do the following to determine the correct number of days with a simple logic and without NLP:

input_list = ["10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days"]

for inp in input_list:
    relevantPart = inp.split("-")[-1]
    relevantPart = relevantPart.strip().replace("  ", " ").split(" ")

    if "day" in relevantPart[-1].lower():
        print(float(relevantPart[0]))
    elif "week" in relevantPart[-1].lower():
        print(float(relevantPart[0]) * 7)
    elif "month" in relevantPart[-1].lower():
        print(float(relevantPart[0]) * 31)
    else:
        print("No valid period of time.")

Basically with split("-")[-1] you just ignore everything what is in front of -. With adding some string handling, clean up and simple math, you will just get what you want.

You might want to add more specific cases (e.g. year) and also handle month differently (count with 28, 29, 30 or 31 days?).

Answered By: ucczs

Answer 2

For something like this where all the kinds values for columnA are quite predictable, it’s highly unlikely that you would need to use any deep learning, if that’s what you mean by NLP.

So, in my experience, what you are suggesting about if-statements is actually the right general idea.

Here is a procedure that uses the tidyverse library and regular expressions:

library(tidyverse)

text_fields <- c("10 - 14 days", "2-3 Weeks", "10-12 days", "8 days", "8-12 days", 
                 "10 days", "7-10 days", "5 days", "7-10 days", "5-7 days", "10 days", 
                 "7 days", "7 - 10 days", "1 week", "5-7 days", "2 weeks", "2-4 weeks", 
                 "10 days", "7-10 days", "8-10 days", "1 week", "10 days", "8-10 days", 
                 "2 weeks", "10-12 days", "7-10 days", "2-3 weeks", "7-10 days", 
                 "10 days", "2 weeks", "8-12 days", "12 days", "10 Days", "7 Days", 
                 "2 weeks", "5-8 days", "8-12 days", "8-12 days", "10 days", "12-14 days", 
                 "10-12 days", "7 days", "5-7 days", "2 weeks", "2-3 weeks", "5-7 days", 
                 "5-7 business days", "5-7 days", "5-7 business days", "7 days", 
                 "2-3 weeks", "7-10 days", "8-12 Days", "10 days", "10 days", 
                 "10 days", "10 days", "10 days", "14", "2 weeks", "10 business days", 
                 "2-3 weeks", "4 days", "1 month", "7-10 days", "8-12 days", "2-3 weeks", 
                 "3-5 days", "10 days", "3-5 days", "2-3 days", "2-3 days", "3-5", 
                 "5-7 days", "7-10 days", "5-7 days", "8-12 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2.5 weeks", "2 Weeks", "10-12 days", 
                 "10-12 days", "7-10 days", "7-10 days", "7-10 days", "7-10 days", 
                 "7-10 days", "7-10 days", "2 weeks", "1 month", "1 month", "1 week"
)

# putting the values in a dataframe. `tibble()` also works
df <- data.frame(text = text_fields)

# The %>% is a special operator that pipes the result into the first argument of the next function. I use it to keep things clean.
df <- df %>% 
  # I capture the last instance of a number in each value
  # then save the captured values to a new column called days
  mutate(days = str_match_all(text, "\d*\.?\d+") %>% 
           # take the last match only
           lapply(tail, 1) %>% 
           # collaspe this list into a simple vector
           unlist() %>% 
           # change text to number
           as.numeric()
         ) %>% 
  # I update the days column according to the type of unit
  mutate(days = case_when(
    # (?i) makes the search case insensitive.
    str_detect(text, "(?i)business") ~ days + 2 * ceiling(days / 7),
    str_detect(text, "(?i)week") ~ days * 7,
    str_detect(text, "(?i)month") ~ ceiling(days * 30.5),
    str_detect(text, "(?i)year") ~ ceiling(days * 365.25),
    TRUE ~ days
  ))

After doing this, you can check to see if there are edge cases you missed and you can expand the function to support those cases. Applying what you suggested about writing training examples, you can instead make a bunch of complicated cases and test your code against them. If your code fails, you update it so that it can handle that case.

Here’s a good resource for testing and learning about regular expressions: https://regex101.com/

Notice that when you put the regular expressions in R, you’ll have to escape the escape character. In place of , use \.

Answered By: Cubic Infinity

How should I approach classifying these text fields into numeric?

Question:

Answers: