How to convert Array to pandas dataframe with datetime ohlcv efficiently, also divide column values by 100?
Question:
Following is the json output I am getting from api
{
"data": [
[
1594373520,
43625,
43640,
43565,
43600,
59561
],
[
1594373820,
43600,
43650,
43505,
43565,
127844
],
[
1594374120,
43560,
43680,
43515,
43660,
74131
]
],
"message": "",
"status": "success"
}
I want to convert this json/array to timestamp, ohlcv data which has DateTime index and the ohlc values must be divided by 100 or sometime by 10000 depending upon the ticksize.
The final output must look something like below:
date open high low close volume
0 2018-04-12 09:15:00+05:30 295.00 295.75 293.25 293.80 55378
1 2018-04-12 09:20:00+05:30 293.75 293.75 292.55 292.95 32219
2 2018-04-12 09:25:00+05:30 292.95 293.40 292.65 292.80 23643
3 2018-04-12 09:30:00+05:30 292.80 293.00 292.75 292.80 12313
4 2018-04-12 09:35:00+05:30 292.75 292.85 291.50 291.55 32198
I know the answer is available on SO but I want to do it efficiently with less code and faster execution.
Moreover, current data is 5min in case I get 1 min data, I would like to create a function to resample the data accordingly.
I will try to update the question with my current code soon.
Code for division by 100. I want to do this for 4 columns (o,h,l,c). Looking for a one liner.
df['A'] = df['A'].div(100).round(2)
Update:: Query is can this be done in an efficient way?
My current code::
import pandas as pd
records = data['data']
df = pd.DataFrame(records, columns=['datetime', 'open', 'high', 'low', 'close', 'volume'])
df['datetime'] = df['datetime'].apply(pd.Timestamp, unit='s', tzinfo=pytz.timezone("Asia/Kolkata"))
df['open'] = df['open'].astype(float).div(100)
df['high'] = df['high'].astype(float).div(100)
df['low'] = df['low'].astype(float).div(100)
df['close'] = df['close'].astype(float).div(100)
df.set_index('datetime', inplace=True)
print(df)
Output ::
open high low close volume
datetime
2020-08-12 09:00:00+05:30 3124.0 3124.0 3120.0 3121.0 168
2020-08-12 09:05:00+05:30 3121.0 3124.0 3121.0 3123.0 163
2020-08-12 09:10:00+05:30 3123.0 3124.0 3122.0 3123.0 133
2020-08-12 09:15:00+05:30 3123.0 3125.0 3122.0 3122.0 154
2020-08-12 09:20:00+05:30 3122.0 3125.0 3122.0 3125.0 131
... ... ... ... ... ...
2020-08-13 23:05:00+05:30 3159.0 3162.0 3157.0 3159.0 432
2020-08-13 23:10:00+05:30 3159.0 3161.0 3155.0 3156.0 483
2020-08-13 23:15:00+05:30 3156.0 3160.0 3154.0 3159.0 1344
2020-08-13 23:20:00+05:30 3159.0 3167.0 3156.0 3165.0 284
2020-08-13 23:25:00+05:30 3165.0 3167.0 3162.0 3164.0 166
[348 rows x 5 columns]
Answers:
If you want to run it all together, I think you can also use the following method. Is this the best way to answer your question?
df[['open','high','low','close']] = df[['open','high','low','close']].astype(float).div(100)
datetime open high low close volume
0 2020-07-10 15:02:00+05:30 436.25 436.4 435.65 436.00 59561
1 2020-07-10 15:07:00+05:30 436.00 436.5 435.05 435.65 127844
2 2020-07-10 15:12:00+05:30 435.60 436.8 435.15 436.60 74131
df = pd.DataFrame(data['data'], columns=['datetime', 'open', 'high', 'low', 'close', 'volume'])
# This will be a more efficient method of getting your time zone correct.
df.datetime = pd.to_datetime(df.datetime, unit='s', utc=True).dt.tz_convert("Asia/Kolkata")
# Let's set the index earlier:
df = df.set_index('datetime')
# Sometimes dropping what you don't want can be
# less typing than selecting what you want:
# Also, You don't need to convert to float,
# division will do that for you.
df = df.drop('volume', axis=1).div(100).combine_first(df)
print(df)
Output:
open high low close volume
datetime
2020-07-10 15:02:00+05:30 436.25 436.4 435.65 436.00 59561
2020-07-10 15:07:00+05:30 436.00 436.5 435.05 435.65 127844
2020-07-10 15:12:00+05:30 435.60 436.8 435.15 436.60 74131
Following is the json output I am getting from api
{
"data": [
[
1594373520,
43625,
43640,
43565,
43600,
59561
],
[
1594373820,
43600,
43650,
43505,
43565,
127844
],
[
1594374120,
43560,
43680,
43515,
43660,
74131
]
],
"message": "",
"status": "success"
}
I want to convert this json/array to timestamp, ohlcv data which has DateTime index and the ohlc values must be divided by 100 or sometime by 10000 depending upon the ticksize.
The final output must look something like below:
date open high low close volume
0 2018-04-12 09:15:00+05:30 295.00 295.75 293.25 293.80 55378
1 2018-04-12 09:20:00+05:30 293.75 293.75 292.55 292.95 32219
2 2018-04-12 09:25:00+05:30 292.95 293.40 292.65 292.80 23643
3 2018-04-12 09:30:00+05:30 292.80 293.00 292.75 292.80 12313
4 2018-04-12 09:35:00+05:30 292.75 292.85 291.50 291.55 32198
I know the answer is available on SO but I want to do it efficiently with less code and faster execution.
Moreover, current data is 5min in case I get 1 min data, I would like to create a function to resample the data accordingly.
I will try to update the question with my current code soon.
Code for division by 100. I want to do this for 4 columns (o,h,l,c). Looking for a one liner.
df['A'] = df['A'].div(100).round(2)
Update:: Query is can this be done in an efficient way?
My current code::
import pandas as pd
records = data['data']
df = pd.DataFrame(records, columns=['datetime', 'open', 'high', 'low', 'close', 'volume'])
df['datetime'] = df['datetime'].apply(pd.Timestamp, unit='s', tzinfo=pytz.timezone("Asia/Kolkata"))
df['open'] = df['open'].astype(float).div(100)
df['high'] = df['high'].astype(float).div(100)
df['low'] = df['low'].astype(float).div(100)
df['close'] = df['close'].astype(float).div(100)
df.set_index('datetime', inplace=True)
print(df)
Output ::
open high low close volume
datetime
2020-08-12 09:00:00+05:30 3124.0 3124.0 3120.0 3121.0 168
2020-08-12 09:05:00+05:30 3121.0 3124.0 3121.0 3123.0 163
2020-08-12 09:10:00+05:30 3123.0 3124.0 3122.0 3123.0 133
2020-08-12 09:15:00+05:30 3123.0 3125.0 3122.0 3122.0 154
2020-08-12 09:20:00+05:30 3122.0 3125.0 3122.0 3125.0 131
... ... ... ... ... ...
2020-08-13 23:05:00+05:30 3159.0 3162.0 3157.0 3159.0 432
2020-08-13 23:10:00+05:30 3159.0 3161.0 3155.0 3156.0 483
2020-08-13 23:15:00+05:30 3156.0 3160.0 3154.0 3159.0 1344
2020-08-13 23:20:00+05:30 3159.0 3167.0 3156.0 3165.0 284
2020-08-13 23:25:00+05:30 3165.0 3167.0 3162.0 3164.0 166
[348 rows x 5 columns]
If you want to run it all together, I think you can also use the following method. Is this the best way to answer your question?
df[['open','high','low','close']] = df[['open','high','low','close']].astype(float).div(100)
datetime open high low close volume
0 2020-07-10 15:02:00+05:30 436.25 436.4 435.65 436.00 59561
1 2020-07-10 15:07:00+05:30 436.00 436.5 435.05 435.65 127844
2 2020-07-10 15:12:00+05:30 435.60 436.8 435.15 436.60 74131
df = pd.DataFrame(data['data'], columns=['datetime', 'open', 'high', 'low', 'close', 'volume'])
# This will be a more efficient method of getting your time zone correct.
df.datetime = pd.to_datetime(df.datetime, unit='s', utc=True).dt.tz_convert("Asia/Kolkata")
# Let's set the index earlier:
df = df.set_index('datetime')
# Sometimes dropping what you don't want can be
# less typing than selecting what you want:
# Also, You don't need to convert to float,
# division will do that for you.
df = df.drop('volume', axis=1).div(100).combine_first(df)
print(df)
Output:
open high low close volume
datetime
2020-07-10 15:02:00+05:30 436.25 436.4 435.65 436.00 59561
2020-07-10 15:07:00+05:30 436.00 436.5 435.05 435.65 127844
2020-07-10 15:12:00+05:30 435.60 436.8 435.15 436.60 74131