Pandas: How to store cProfile output in a pandas DataFrame?

Question:

There already exist some posts discussing python profiling using cProfile, as well as the challenges of analyzing the output due to the fact that the output file restats from the sample code below is not a plain text file. The snippet below is only a sample from docs.python.org/2/library/profile, and not directly reproducible.

import cProfile
import re
cProfile.run('re.compile("foo|bar")', 'restats')

There is one discussion here: Profile a python script using cProfile into an external file, and on the docs.python.org there are more details on how to analyze the output using pstats.Stats (still only a sample, and not reproducible):

import pstats
p = pstats.Stats('restats')
p.strip_dirs().sort_stats(-1).print_stats()

I may be missing some very important details here, but I would really like to store the output in a pandas DataFrame and do further analysis from there.

I thought it would be pretty simple since the output in iPython from running cProfile.run() looks fairly tidy:

In[]:
cProfile.run('re.compile("foo|bar")'

Out[]:

enter image description here

Any suggestions on how to get this into a pandas DataFrame in the same format?

Asked By: vestland

||

Answers:

It looks like https://github.com/ssanderson/pstats-view might do what you want (albeit with unnecessary dependencies related to visualising the data and making it interactive):

>>> from pstatsviewer import StatsViewer
>>> sv = StatsViewer("/path/to/profile.stats")
>>> sv.timings.columns
Index(['lineno', 'ccalls', 'ncalls', 'tottime', 'cumtime'], dtype='object')
Answered By: joeln

I know this already has an answer, but for anyone who doesn’t want to go to the trouble of downloading another module, here’s a rough and ready script that should get close:

%%capture profile_results    ## uses %%capture magic to send stdout to variable
cProfile.run("your_function( **run_parms )")

Run the above first, to populate profile_results with the contents of stout, which contains the usual printed output of cProfile.

## Parse the stdout text and split it into a table
data=[]
started=False

for l in profile_results.stdout.split("n"):
    if not started:
        if l=="   ncalls  tottime  percall  cumtime  percall filename:lineno(function)":
            started=True
            data.append(l)
    else:
        data.append(l)
content=[]
for l in data:
    fs = l.find(" ",8)
    content.append(tuple([l[0:fs] , l[fs:fs+9], l[fs+9:fs+18], l[fs+18:fs+27], l[fs+27:fs+36], l[fs+36:]]))
prof_df = pd.DataFrame(content[1:], columns=content[0])

It wont win any prizes for elegance or pleasant style, but it does force that table of results into a filterable dataframe format.

prof_df 

enter image description here

Answered By: Thomas Kimber

If your doing this in the cmd with
python -m cProfile your_script.py

you can push the output to a csv file and then parse with pandas
python -m cProfile your_script.py >> output.txt

Then parse the output with pandas

df = pd.read_csv('output.txt', skiprows=5, sep='    ', names=['ncalls','tottime','percall','cumti    me','percall','filename:lineno(function)'])
df[['percall.1', 'filename']] = df['percall.1'].str.split(' ', expand=True, n=1)
df = df.drop('filename:lineno(function)', axis=1)
Answered By: Kenan

You can use this function to accomplish this task

def convert_to_df(path, offset=6):
    """
    path: path to file
    offset: line number from where the columns start
    """
    with open(path, "r") as f:
        core_profile = f.readlines()
    core_profile = core_profile[offset:]
    cols = core_profile[0].split()
    n = len(cols[:-1])
    data = [_.split() for _ in core_profile[1:]]
    data = [_ if len(_)==n+1 else _[:n]+[" ".join(_[n+1:])] for _ in data]
    data_ = pd.DataFrame(data, columns=cols)
    return data_
Answered By: Akhil Vempali

In case people don’t want to use %%capture or go through a CSV, below a cobbled-together solution, in this case comparing multiple cProfiles in the same folder by (1) sorting each cProfile by cumulative time and (2) adding only the top result (pstats.Stats(f, stream = p_output).sort_stats("cumulative").print_stats(1)) from each .prof to a data frame (along with a portion of the .prof filename to identify which profile the measurement came from).

See here for some of the original code (which does use CSVs as the intermediary).

import io
import pstats
import pandas as pd
import glob

all_files = glob.glob(profiledir + "/*.prof")

li = []

for f in all_files:
    
    p_output = io.StringIO()

    prof_stats = pstats.Stats(f, stream = p_output).sort_stats("cumulative").print_stats(1)

    p_output = p_output.getvalue()
    p_output = 'ncalls' + p_output.split('ncalls')[-1]
    result = 'n'.join([','.join(line.rstrip().split(None,5)) for line in p_output.split('n')])

    df = pd.read_csv(io.StringIO(result), sep=",", header=0)
    df['module name'] = f.split(' ')[0].split('\')[1] # differs depending on your file naming convention
    li.append(df) 

df = pd.concat(li, axis=0, ignore_index=True)
Answered By: grg

I know this question is a few old, but I found an easy way to solve it.

import cProfile
import pandas as pd

with cProfile.Profile() as pr:
    # run something

df = pd.DataFrame(
    pr.getstats(),
    columns=['func', 'ncalls', 'ccalls', 'tottime', 'cumtime', 'callers']
)

Here is my recipe:

python -m cProfile -o profile.pstats some.py

And then

from pstats import Stats
import pandas as pd

st = Stats('./profile.pstats')

keys_from_k = ['file', 'line', 'fn']
keys_from_v = ['cc', 'ncalls', 'tottime', 'cumtime', 'callers']
data = {k: [] for k in keys_from_k + keys_from_v}

s = st.stats

for k in s.keys():
    for i, kk in enumerate(keys_from_k):
        data[kk].append(k[i])

    for i, kk in enumerate(keys_from_v):
        data[kk].append(s[k][i])

pd.DataFrame(data).to_csv('./profile.csv')
Answered By: cnstntn.kndrtv
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.