Store (df.info) method output in DataFrame or CSV

Question:

I have a giant Dataframe(df) that’s dimensions are (42,— x 135). I’m running a df.info on it, but the output is unreadable. I’m wondering if there is any way to dump it in a Dataframe or CSV? I think it has something to do with:

```buf : writable buffer, defaults to sys.stdout
```Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer 
```if you need to further process the output."

But when i add a (buf = buffer) the output is just each word in the output then a new line which is very hard to read/work with. My goal is to be-able to better understand what columns are in the dataframe and to be able to sort them by type.

Asked By: Adam Safi

||

Answers:

You need to open a file then pass the file handle to df.info:

with open('info_output.txt','w') as file_out:
  df.info(buf=file_out)
Answered By: mechanical_meat
import pandas as pd
df = pd.read_csv('/content/house_price.csv')
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.csv", "w", encoding="utf-8") as f: f.write(s.split(" -----  ")[1].split("dtypes")[0])
di = pd.read_csv('df_info.csv', sep="s+", header=None)
di

Just to build on mechanical_meat’s and Adam Safi’s combined solution, the following code will convert the info output into a dataframe with no manual intervention:

with open('info_output.txt','w') as file_out:
    df.info(buf=file_out)

info_output_df = pd.read_csv('info_output.txt', sep="s+", header=None, index_col=0,  engine='python', skiprows=5, skipfooter=2)

Note that according to the docs, the ‘skipfooter’ option is only compatible with the python engine.

Answered By: DPD91

You could try avoiding pandas.dataframe.info() and instead create the information that you need as a pandas.DataFrame:

import pandas as pd


def get_info(df: pd.DataFrame):
    info = df.dtypes.to_frame('dtypes')
    info['non_null'] = df.apply(lambda srs: len(srs.dropna()))
    info['unique_values'] = df.apply(lambda srs: len(srs.unique()))
    info['first_row'] = df.iloc[0]
    info['last_row'] = df.iloc[-1]
    return info

And write it to csv with df.to_csv('info_output.csv').

The memory usage information may also be useful, so you could do:

df.memory_usage().sum()
Answered By: alh
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.