How to remove a comma and format a date within a web-scraped dataframe for printing to csv?

Question:

I am fairly new to web scraping and am attempting to pull dividend history from yahoo finance for analysis.

How can I reformat a date I scraped? It includes a comma, which causes the year to be in the same column as the dividend when printed to a csv file. I need the date in one column, and the dividend in another. The date needs to be stored in a format that it can be used in python such as MM-DD-YYYY. I’ve been at this for hours and can’t seem to figure it out. Do I need the csv module?

Here are a few relevant snippets from my code, and the output is at the end.

import os
from datetime import datetime, timedelta
import time, requests, pandas, lxml
from lxml import html
from yahoofinancials import YahooFinancials

 def scrape_page(url, header):
     page = requests.get(url, headers=header)
     element_html = html.fromstring(page.content)
     table = element_html.xpath('//table')
     table_tree = lxml.etree.tostring(table[0], method='xml')
     panda = pandas.read_html(table_tree)
     return panda
     
def clean_dividends(symbol, dividends):
     index = len(dividends)     
     dividends = dividends.drop(index-1) #Drop the last row of the dataframe
     dividends = dividends.set_index('Date') #Set the row index to the column labelled Date
     dividends = dividends['Dividends'] #Store only the dividend column indexed by date into a variable
     dividends = dividends.str.replace('Dividend', '', regex = True) #Remove all the strings in the dividend column
     dividends = dividends.astype(float) #Convert the dividend amounts to float values from strings
     dividends.name = symbol #Change the name of the resulting pandas series object to the symbol
     return dividends
if __name__ == '__main__':
     start = datetime.today() - timedelta(days=3650)
     end = datetime.today()#properly format the date to epoch time
     start = format_date(start)
     end = format_date(end)#format the subdomain
     sub = subdomain(ticker, start, end)#customize the request header
     hdrs = header(sub)
     
     #concatenate the subdomain with the base URL
     base_url = "https://finance.yahoo.com/quote/"
     url = base_url + sub
     dividends = scrape_page(url, hdrs) #scrape the dividend history table from Yahoo Finance
     clean_div = clean_dividends(ticker, dividends[0]) #clean the dividend history table
     print(clean_div)

#print to csv file
# check whether the file exists
if os.path.exists(ticker+"_div_hist.csv"):
    # delete the file
    os.remove(ticker+"_div_hist.csv")
    
f = open(ticker+"_div_hist.csv", "w")
f.write(str(clean_div))
f.close()

Here is the a shortened example of the output in both terminal and csv format:

Date
Nov 04, 2022    0.51
Jul 29, 2022    0.51
Apr 29, 2022    0.51
Feb 04, 2022    0.51
Name: C, dtype: float64

CSV file output

Thank you for any help that you can provide.

Edit: Removed less relevant code

Asked By: user14894283

||

Answers:

You can convert MonthName Day, Year format into datetime format that you are wishing for with following code snippet:

from datetime import datetime

dateString = "Nov 04, 2022"
datetimeObject = datetime.strptime(dateString, '%b %d, %Y')
print(datetimeObject.strftime('%d-%m-%Y'))

#Output string is '04-11-2022'

If you wish to convert the date into any format in easier way, you can use timestamps. All you need to do is using datetimeObject.timestamp() to obtain timestamp. After that you can use datetime.fromtimestamp(timestamp) function to obtain datetime object from timestamp any time.

So, after that you you can use datetime formatting to get any datetime string you like.

Here is the link that you can learn more about datetime formatting: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

Answered By: Refet

Lets take the dataframe returned from your scrape and manipulate that into saving what you want to csv,. In the following code I replicate the dataframe which has an incorrectly formatted date column (the comma causing issues that saves the year in a seperate column in the csv file). In order to change all the date formats in the dataframe Date column we must use a lambda function that allows you to apply a formatting function to every row of that column. We then make Date the index and copy it to the CSV:

import pandas as pd


def convert_date_string(date_str):
    mon, day, year = date_str.split()
    return ' '.join([day.replace(',', ''), mon, year])


df = pd.DataFrame([['Nov 04, 2022', '0.51'], ['Jul 29, 2022', '0.51'], ['Jul 29, 2022',   '0.51'], ['Apr 29, 2022', '0.51'], ['Feb 04, 2022', '0.51']], columns=['Dates', 'Divs'])

df['Dates'] = df['Dates'].apply(lambda x: convert_date_string(x))
df = df.set_index(['Dates'])
df.to_csv('filename.csv')

Note here that we do not use any datetime functionality. If you are merely copying the date to a csv file, there is no need to convert the string to a datetime class – you just re-format the string. If you DO need to use datetime functionality, you will need to convert the Date string to a timestamp python understands. I hope this is what you are after.

Answered By: Galo do Leste