windows processes xml file to pandas dataframe?
Question:
i would like to convert the results of the following windows command to a pandas dataframe.
raw data is generated with this command on windows machine
wmic process get Caption, Processid, ParentProcessId, CommandLine,
CreationDate, KernelModeTime, UserModeTime, ThreadCount, HandleCount,
WorkingSetSize, PeakWorkingSetSize, VirtualSize, PeakVirtualSize,
PageFaults, PageFileUsage, PeakPageFileUsage, ReadOperationCount,
WriteOperationCount, OtherOperationCount /format:rawxml
with the following code
with f.open("RunningProcess.xml") as praw:
etree = et.parse(praw)
xroot = etree.getroot()
nprop = []
for property in xroot.iter("PROPERTY"):
xnames = property.get("NAME")
nprop.append(xnames)
npropf = pandas.DataFrame(index=nprop)
rprows = []
data = []
inner = {}
for child in xroot.iter("PROPERTY"):
for gchild in child.iterfind('VALUE'):
inner[gchild.tag] = gchild.text
data.append(inner)
rprows.append(pandas.DataFrame(data))
data = []; inner = {}
finaldf = pandas.concat(rprows, sort=False).reset_index(drop=True)
finaldf.index = nprop
rpdfhtml = finaldf.to_html(index=True, header=True, border=1)
I get this result
I would like to
- make the first 20 index rows to columns (caption to
WriteOperationCount)
- make values column as rows instead.
like this example
first 9 columns of desired output
Answers:
Welcome! This was an interesting question. This isn’t perfect but hopefully it helps
I wanted to try to avoid hard coding any columns of interest.
Assumptions – This file will have a predictable pattern of field names.
I used xml.etree.ElementTree
, I find this to a straight forward library
import xml.etree.ElementTree as ET
reference the xml file
file = '/location/to/file/RunningProcess.xml'
Create flattened DataFrame. I personally find this easier to parse than working entirely within the xml pulling the XML
First create a flatted list
tree = ET.parse(file)
root = tree.getroot()
ls_processes = []
for COMMAND in root.iter('COMMAND'):
for RESULTS in COMMAND.iter('RESULTS'):
for PROPERTY in RESULTS.iter('PROPERTY'):
VALUE = PROPERTY.find('VALUE')
if VALUE is not None:
print(PROPERTY.attrib['NAME'],'|',PROPERTY.attrib['TYPE'],'|', VALUE.text )
ls_processes.append([PROPERTY.attrib['NAME'],PROPERTY.attrib['TYPE'], VALUE.text])
else:
print(PROPERTY.attrib['NAME'],'|',PROPERTY.attrib['TYPE'],'|', "NO VALUE")
ls_processes.append([PROPERTY.attrib['NAME'],PROPERTY.attrib['TYPE'], 'NO VALUE'])
This will produce something which looks a bit like this
Caption | string | System Idle Process
CommandLine | string | NO VALUE
CreationDate | datetime | 20191002111400.978894+060
HandleCount | uint32 | 0
KernelModeTime | uint64 | 159488690156250
OtherOperationCount | uint64 | 0
Transform into a Dataframe
df_processes = pd.DataFrame(ls_processes)
Rename columns to make the Dataframe easier to work with
df_processes.columns = ['data','type','value']
Create a list of columns of interest
ls_columns = ['Caption', 'ProcessId', 'ParentProcessId', 'CommandLine', 'CreationDate', 'KernelModeTime', 'UserModeTime', 'ThreadCount', 'HandleCount', 'WorkingSetSize', 'PeakWorkingSetSize', 'VirtualSize', 'PeakVirtualSize', 'PageFaults', 'PageFileUsage', 'PeakPageFileUsage', 'ReadOperationCount', 'WriteOperationCount', 'OtherOperationCount']
Create Dataframe columns of each column of interest
ls_processes = []
for column in ls_columns:
print(column)
ls_row = []
for index, row in df_processes.iterrows():
if row['data'] == column:
ls_row.append(row['value'])
df = pd.DataFrame(ls_row)
ls_processes.append(df)
Concat the Dataframes together by columns
df_processes_flat = pd.concat(ls_processes, axis = 1 )
Add the column names using the list previously created
df_processes_flat.columns = ls_columns
You’ll end up with a Dataframe which looks like this
I would say these steps aren’t possible the most elegant but hopefully it’s clear whats going on.
i would like to convert the results of the following windows command to a pandas dataframe.
raw data is generated with this command on windows machine
wmic process get Caption, Processid, ParentProcessId, CommandLine,
CreationDate, KernelModeTime, UserModeTime, ThreadCount, HandleCount,
WorkingSetSize, PeakWorkingSetSize, VirtualSize, PeakVirtualSize,
PageFaults, PageFileUsage, PeakPageFileUsage, ReadOperationCount,
WriteOperationCount, OtherOperationCount /format:rawxml
with the following code
with f.open("RunningProcess.xml") as praw:
etree = et.parse(praw)
xroot = etree.getroot()
nprop = []
for property in xroot.iter("PROPERTY"):
xnames = property.get("NAME")
nprop.append(xnames)
npropf = pandas.DataFrame(index=nprop)
rprows = []
data = []
inner = {}
for child in xroot.iter("PROPERTY"):
for gchild in child.iterfind('VALUE'):
inner[gchild.tag] = gchild.text
data.append(inner)
rprows.append(pandas.DataFrame(data))
data = []; inner = {}
finaldf = pandas.concat(rprows, sort=False).reset_index(drop=True)
finaldf.index = nprop
rpdfhtml = finaldf.to_html(index=True, header=True, border=1)
I get this result
I would like to
- make the first 20 index rows to columns (caption to
WriteOperationCount) - make values column as rows instead.
like this example
first 9 columns of desired output
Welcome! This was an interesting question. This isn’t perfect but hopefully it helps
I wanted to try to avoid hard coding any columns of interest.
Assumptions – This file will have a predictable pattern of field names.
I used xml.etree.ElementTree
, I find this to a straight forward library
import xml.etree.ElementTree as ET
reference the xml file
file = '/location/to/file/RunningProcess.xml'
Create flattened DataFrame. I personally find this easier to parse than working entirely within the xml pulling the XML
First create a flatted list
tree = ET.parse(file)
root = tree.getroot()
ls_processes = []
for COMMAND in root.iter('COMMAND'):
for RESULTS in COMMAND.iter('RESULTS'):
for PROPERTY in RESULTS.iter('PROPERTY'):
VALUE = PROPERTY.find('VALUE')
if VALUE is not None:
print(PROPERTY.attrib['NAME'],'|',PROPERTY.attrib['TYPE'],'|', VALUE.text )
ls_processes.append([PROPERTY.attrib['NAME'],PROPERTY.attrib['TYPE'], VALUE.text])
else:
print(PROPERTY.attrib['NAME'],'|',PROPERTY.attrib['TYPE'],'|', "NO VALUE")
ls_processes.append([PROPERTY.attrib['NAME'],PROPERTY.attrib['TYPE'], 'NO VALUE'])
This will produce something which looks a bit like this
Caption | string | System Idle Process
CommandLine | string | NO VALUE
CreationDate | datetime | 20191002111400.978894+060
HandleCount | uint32 | 0
KernelModeTime | uint64 | 159488690156250
OtherOperationCount | uint64 | 0
Transform into a Dataframe
df_processes = pd.DataFrame(ls_processes)
Rename columns to make the Dataframe easier to work with
df_processes.columns = ['data','type','value']
Create a list of columns of interest
ls_columns = ['Caption', 'ProcessId', 'ParentProcessId', 'CommandLine', 'CreationDate', 'KernelModeTime', 'UserModeTime', 'ThreadCount', 'HandleCount', 'WorkingSetSize', 'PeakWorkingSetSize', 'VirtualSize', 'PeakVirtualSize', 'PageFaults', 'PageFileUsage', 'PeakPageFileUsage', 'ReadOperationCount', 'WriteOperationCount', 'OtherOperationCount']
Create Dataframe columns of each column of interest
ls_processes = []
for column in ls_columns:
print(column)
ls_row = []
for index, row in df_processes.iterrows():
if row['data'] == column:
ls_row.append(row['value'])
df = pd.DataFrame(ls_row)
ls_processes.append(df)
Concat the Dataframes together by columns
df_processes_flat = pd.concat(ls_processes, axis = 1 )
Add the column names using the list previously created
df_processes_flat.columns = ls_columns
You’ll end up with a Dataframe which looks like this
I would say these steps aren’t possible the most elegant but hopefully it’s clear whats going on.