Python detect dates containing commas and remove comma from text file

Question:

I have a txt file that contains dates in columns like below. The comma between the day and year is making it hard to import the data into pandas using pd.read_csv(). This is contained within a text file that has other data that should be ignored, so I can’t perform some action on the entire document. I need to go through the file, find the dates with this formatting, and remove the commas within the dates, leaving the commas between dates. What’s a simple way to accomplish this?

May 15, 2023, May 22, 2023
August 14, 2023, August 21, 2023
November 14, 2023, November 21, 2023
February 14, 2024, February 22, 2024
Asked By: Troy D

||

Answers:

A possible solution:

  1. Read the CSV file in a way that you get a dataframe with a single column, say, named a. (You can, for instance, use a separator that does not exist in the file.)

  2. Use the following to remove the comma from dates:

df['a'].str.replace(r'(?<=d),(?=sd{4})', '', regex=True)
  1. Save the dataframe as a text file.

  2. Open the new CSV file with pd.read_csv.

Answered By: PaulS

You can also go by this approach using re.fidall() to remove the commas from dates then write the output to new file.

import re

with open('my_text_file', 'r') as infile, open('output_file', 'w') as outfile:
    file_lines = infile.readlines()
    
    for line in file_lines:
        dates = re.findall(r"b[A-Za-z]+sd+,sd+b", line)
        for date in dates:
            line = line.replace(date, date.replace(",", ""))
        outfile.write(line)

This approach will produce the below output which I think is the output you are looking for.


May 15 2023, May 22 2023 
August 14 2023, August 21 2023 
November 14 2023, November 21 2023 
February 14 2024, February 22 2024 

Test Code:

import re

file_data = """May 15, 2023, May 22, 2023
August 14, 2023, August 21, 2023
November 14, 2023, November 21, 2023
February 14, 2024, February 22, 2024
"""

result = ""
file_lines = file_data.split('n')

for line in file_lines:
    dates = re.findall(r"b[A-Za-z]+sd+,sd+b", line)
    for date in dates:
        line = line.replace(date, date.replace(",", ""))   
    result += f"{line} n"
        
print(result)
Answered By: Jamiu S.
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.