extract and filter values from string list python

Question:

so I have an array that looks like the one below. the "error" substring always starts with this character a special character "‘" so I was able to just get the errors with something like this

a = [' 276ARDUINO_i2c.c:70:27: error: ‘ARDUINO_I2C_nI2C', ' 248rpy_i2c.h:76:40: error: ‘RPY_I2C_BASE_ADDR_LIST', ' 452rpy_i2c.c:79:77: error: ‘RPY_I2C_IRQ_LIST']
newlist = [x.split('‘')[1] for x in a]
print(newlist)

and the output would look like this

['ARDUINO_I2C_nI2C', 'RPY_I2C_BASE_ADDR_LIST', 'RPY_I2C_IRQ_LIST']  

but now, i also need to get the name of the file related to that error. The name of the file always start with a numeric substring that I also need to remove. the output I want would look like this

   ['ARDUINO_i2c.c', 'ARDUINO_I2C_nI2C'], ['rpy_i2c.h', 'RPY_I2C_BASE_ADDR_LIST'], ['rpy_i2c.c','RPY_I2C_IRQ_LIST']

I’ll apreciate any suggestions. thanks.

Asked By: pekoms

||

Answers:

You could use a regular expression to capture the required parts of your string. For example, the following regex (Try it online):

d+([^:]+):.*‘(.*)$

Explanation:
-----------
d+                     : One or more numbers
   (     )    (  )      : Capturing groups
    [^:]+               : One or more non-colon characters (in capturing group 1)
          :             : One colon
           .*           : Any number of any character
             ‘          : The ‘ character
               .*       : Any number of any character (in capturing group 2)
                  $     : End of string

To use it:

import re

regex = re.compile(r"d+([^:]+):.*‘(.*)$")

newlist = [regex.search(s).groups() for s in a]

which gives a list of tuples:

[('ARDUINO_i2c.c', 'ARDUINO_I2C_nI2C'),
 ('rpy_i2c.h', 'RPY_I2C_BASE_ADDR_LIST'),
 ('rpy_i2c.c', 'RPY_I2C_IRQ_LIST')]

If you really want a list of lists, you can convert the result of .groups() to a list:

newlist = [list(regex.search(s).groups()) for s in a]
Answered By: Pranav Hosangadi

I have created this code to get the exact result as you like but there could be more efficient ways too. I have split the values and used regex to get the needed result.

import re
a = [' 276ARDUINO_i2c.c:70:27: error: ‘ARDUINO_I2C_nI2C', '248rpy_i2c.h:76:40: error: ‘RPY_I2C_BASE_ADDR_LIST', ' 452rpy_i2c.c:79:77: error: ‘RPY_I2C_IRQ_LIST']
r=[]
for x in a:
    d=x.split(": error: ‘")
    r.append([re.sub("[0-9]{3}","",d[0].split(":")[0].strip()),d[1]])
print(r)
Answered By: Jeson Pun

We can’t do this in list comprehension easily. It’s better to use for loop here.

Like this:

# Your data
a = [' 276ARDUINO_i2c.c:70:27: error: ‘ARDUINO_I2C_nI2C', ' 248rpy_i2c.h:76:40: error: ‘RPY_I2C_BASE_ADDR_LIST', ' 452rpy_i2c.c:79:77: error: ‘RPY_I2C_IRQ_LIST']

# A list to hold your either dicts or lists
new = []

# For loop
for i in a: 
    
    # We can split using ': ' as it's consistent with all data. 
    # The only problem in this logic is that we will get word 'error' too, so we need to ignore it, thus use '_'.
    # Next problem is that you've space at the start, so I used .strip to get rid of those.

    name, _, error = i.strip().split(': ')

    # Now since you don't need number at the start of name, we will use .lstrip() and provide all numbers!

    name = name.lstrip('0123456789') # Every char that is in passed string in lstrip() method is used to remove.

    # If you want list
    new.append([name, error]

    # Or if you want dict -> uncomment below & comment above
##    new.append({name: error}) 

print(new)

# output:

[['ARDUINO_i2c.c:70:27', '‘ARDUINO_I2C_nI2C'], ['rpy_i2c.h:76:40', '‘RPY_I2C_BASE_ADDR_LIST'], ['rpy_i2c.c:79:77', '‘RPY_I2C_IRQ_LIST']]
Answered By: Prabhas Kumar
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.