windows processes xml file to pandas dataframe?


i would like to convert the results of the following windows command to a pandas dataframe.

raw data is generated with this command on windows machine

wmic process get Caption, Processid, ParentProcessId, CommandLine,
CreationDate, KernelModeTime, UserModeTime, ThreadCount, HandleCount,
WorkingSetSize, PeakWorkingSetSize, VirtualSize, PeakVirtualSize,
PageFaults, PageFileUsage, PeakPageFileUsage, ReadOperationCount,
WriteOperationCount, OtherOperationCount /format:rawxml

with the following code

with"RunningProcess.xml") as praw:

etree = et.parse(praw)
xroot = etree.getroot()
nprop = []

for property in xroot.iter("PROPERTY"):
    xnames = property.get("NAME")

npropf = pandas.DataFrame(index=nprop)
rprows = []
data = []
inner = {}

for child in xroot.iter("PROPERTY"):
    for gchild in child.iterfind('VALUE'):
        inner[gchild.tag] = gchild.text
    data = []; inner = {}

finaldf = pandas.concat(rprows, sort=False).reset_index(drop=True)

finaldf.index = nprop

rpdfhtml = finaldf.to_html(index=True, header=True, border=1)

I get this result

first 39 lines of output

I would like to

  • make the first 20 index rows to columns (caption to
  • make values column as rows instead.

like this example
first 9 columns of desired output

Asked By: digikwondo



Welcome! This was an interesting question. This isn’t perfect but hopefully it helps

I wanted to try to avoid hard coding any columns of interest.

Assumptions – This file will have a predictable pattern of field names.

I used xml.etree.ElementTree, I find this to a straight forward library

import xml.etree.ElementTree as ET

reference the xml file

file = '/location/to/file/RunningProcess.xml'

Create flattened DataFrame. I personally find this easier to parse than working entirely within the xml pulling the XML

First create a flatted list

tree = ET.parse(file)
root = tree.getroot()

ls_processes = []

for COMMAND in root.iter('COMMAND'):
    for RESULTS in COMMAND.iter('RESULTS'):
        for PROPERTY in RESULTS.iter('PROPERTY'):

            VALUE = PROPERTY.find('VALUE') 

            if VALUE is not None:
                print(PROPERTY.attrib['NAME'],'|',PROPERTY.attrib['TYPE'],'|', VALUE.text )
                ls_processes.append([PROPERTY.attrib['NAME'],PROPERTY.attrib['TYPE'], VALUE.text])
                print(PROPERTY.attrib['NAME'],'|',PROPERTY.attrib['TYPE'],'|', "NO VALUE")
                ls_processes.append([PROPERTY.attrib['NAME'],PROPERTY.attrib['TYPE'], 'NO VALUE'])

This will produce something which looks a bit like this

Caption | string | System Idle Process
CommandLine | string | NO VALUE
CreationDate | datetime | 20191002111400.978894+060
HandleCount | uint32 | 0
KernelModeTime | uint64 | 159488690156250
OtherOperationCount | uint64 | 0 

Transform into a Dataframe

df_processes = pd.DataFrame(ls_processes)

Rename columns to make the Dataframe easier to work with

df_processes.columns = ['data','type','value']

Create a list of columns of interest

ls_columns = ['Caption', 'ProcessId', 'ParentProcessId', 'CommandLine', 'CreationDate', 'KernelModeTime', 'UserModeTime', 'ThreadCount', 'HandleCount', 'WorkingSetSize', 'PeakWorkingSetSize', 'VirtualSize', 'PeakVirtualSize', 'PageFaults', 'PageFileUsage', 'PeakPageFileUsage', 'ReadOperationCount', 'WriteOperationCount', 'OtherOperationCount']

Create Dataframe columns of each column of interest

ls_processes = []
for column in ls_columns:
    ls_row = []
    for index, row in df_processes.iterrows():
        if row['data'] == column: 

    df = pd.DataFrame(ls_row)

Concat the Dataframes together by columns

df_processes_flat = pd.concat(ls_processes, axis = 1 ) 

Add the column names using the list previously created

df_processes_flat.columns = ls_columns

You’ll end up with a Dataframe which looks like this

enter image description here

I would say these steps aren’t possible the most elegant but hopefully it’s clear whats going on.

Answered By: the_good_pony
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.