Regex for extracting the keyboard keys from strings with keyboard shortcuts

Question:

It must be very simple, but somehow it is maybe not my day and I fail to come up with the right code.

Below one of my multiple attempts:

import re
regex = r"(.*?)+((.*?)+)*(.*$)"
print( 
  re.findall(regex, 'Alt+Super+Right'),
  re.findall(regex, 'Alt+Super+='),
  re.findall(regex, 'Shift+Super++'),
  sep='n')

which instead of:

['Alt','Super','Right']
['Alt','Super','=']
['Shift','Super','+']

prints:

[('Alt', 'Super+', 'Super', 'Right')]
[('Alt', 'Super+', 'Super', '=')]
[('Shift', '+', '', '')]

I have no idea why it does not work especially in the last case. Any hints putting me on the right track?


UPDATE to the question required in order to cover special cases on which a regex can fail using a regex from one of the answers to my question which works for the cases listed above, but not for the cases below:

import re
regex =  r"([^+]+|.+)+?"
print( 
  re.findall(regex, '++Super+Alt'),
  re.findall(regex, 'Super+++Alt'),
  sep='n'
)

giving:

['++Super+Alt']
['Super', '++Alt']

instead of:

['+', 'Super', 'Alt']
['Super', '+', 'Alt']

This quite strange notation of keyboard shortcuts is how the shortcuts are shown in the Keyboard dialog of Linux Mint Cinnamon:

enter image description here

I suppose that the + will always be listed as the last one and never somewhere between. So there will actually be no Super+++Alt, but the notation as such suggests that it would ‘allow’ it.

The dump of the database shows that it doesn’t use this notation in the textual dump using another one where <Primary><Super>Left means the left Windows key along with the arrow key and the Shift+Super++ appear as <Shift><Super>plus.

Asked By: Claudio

||

Answers:

One of the problems is that ((.*?)+)* will only capture the last part of the repetition by the final *. Secondly, you should require at least one character to be captured before the +. And be aware that findall will give a result for each capture group, so make sure to only have one — as you expect only one per part.

You could use this regex:

regex = r"(.[^+]*)+?"

So:

  • . – The first character in the capture group can be any character (so also a +)
  • [^+]* – The captured group can include more characters as long as they are not +
  • +? – If there is a + that follows the capture group, then match it before continuing with a next match. This would be the case when not yet at the end of the input.
Answered By: trincot

If you are fine with not using a regex below a function which does the job:

def splitKeybShortcut(keybShortcut):
    lstKeys = []
    max_i = len(keybShortcut)
    i = 1
    key = keybShortcut[0] 
    while i < max_i:
        chr_i = keybShortcut[i] 
        if chr_i != '+':
            key += chr_i
        else:
            lstKeys.append(key)
            i+=1
            key = keybShortcut[i]
        i += 1
    if key:
        lstKeys.append(key)
    return lstKeys
# -------------------------------------
print(splitKeybShortcut('++Super+Alt'))
print(splitKeybShortcut('Super+++Alt'))

The advantage of the above approach over usage of a regular expression is that you have more control about what and how something is done where the regex ‘hides’ the machinery under the hood and you must believe and trust that it will always do what you expect from it with the risk that your way of understanding what and how it does what it does may be a wrong one.

To demonstrate what I have said above below some more code I feed with malformed keyboard shortcut definition strings:

print( "=====================")
import re
regex = r"(.[^+]*)+?"
regexExplanation = """
  "(.[^+]*)+?"
   (.[^+]*)    1-st capturing group (.[^+]*)
    .          matches exactly one char which can be anything (except new line)
     [^+]      match ANY single character EXCEPT a '+' one  
         *          ^---  0 to any times, as many as possible, giving back as needed (greedy)
          +   matches the character '+' with index 4310 (2B16 or 538) literally (case sensitive)
            ?  matches the previous token between zero and one times (greedy)
It matches a beginning '+' if any than finds a match after match for the
capturing group pattern. The pattern that follows fails except at the 
end of a string where it is able to match a single '+'. 
"""

print( 
  re.findall(regex, '+++Super+Alt'),
  re.findall(regex, '++Super++Alt'),
  re.findall(regex, 'Super+Alt+++'),
  re.findall(regex, 'Super+++Alt'),
  sep='n'
)

print( "================================")
def splitKeybShortcut(keybShortcut):
    lstKeys = []
    max_i = len(keybShortcut)
    i = 1
    key = keybShortcut[0] 
    while i < max_i:
        if len(key)>1 and key[0] =='+':
            return f'''
Malformed Shortcut: {keybShortcut} 
''' + f'''
                 {i*' '}--^'''
        chr_i = keybShortcut[i] 
        if chr_i != '+':
            key += chr_i
        else:
            lstKeys.append(key)
            i+=1
            if i < max_i: 
                key = keybShortcut[i]
            else:
                break
        i += 1
    if key == '+' == lstKeys[-1]:
            return f'''
Malformed Shortcut: {keybShortcut} 
''' + f'''
                 {i*' '}--^'''
    else:
        lstKeys.append(key)
    return lstKeys
# -------------------------------------
print( 
  splitKeybShortcut('+++Super+Alt'),
  splitKeybShortcut('++Super++Alt'),
  splitKeybShortcut('Super+Alt+++'),
  splitKeybShortcut('Super+++Alt'),
  sep='n'
)

which prints:

=====================
['+', '+Super', 'Alt']
['+', 'Super', '+Alt']
['Super', 'Alt', '+']
['Super', '+', 'Alt']
================================
Malformed Shortcut: +++Super+Alt 
                     --^
Malformed Shortcut: ++Super++Alt 
                           --^
Malformed Shortcut: Super+Alt+++ 
                             --^
['Super', '+', 'Alt']

Maybe it is possible somehow to tweak the regex along with the code using it to achieve detection of problems with for example typos in the shortcut strings pointing exactly where they occur, but I don’t see myself any as easy way to achieve that as it is the case when using an own function. Anyway, even a short regex needs usually a long explanation because it is not as easy to read and understand as a simple flow of code with a loop and if statements.

An attempt to make a regex pattern better readable code Python re module comes with the flag re.X == re.VERBOSE (see comment by AKX below). Using this flag it is possible to explain the regular expression pattern from ‘inside’:

import re
regex = re.compile(r"""
# Below how the regex engine will work given the regex pattern
#                        r"(.[^+]*)+?"
  (  # remember begin of a capturing group
    .  # look for exactly one char of any kind (except new line)
      [  # remember this is the beginning of a list of characters
        ^+  # item in the list: NOT a plus sign '+' ('^' means here NOT)
      ]  # end of the list covering any character except a plus sign
           # look for any char except the plus sign 
        *    # any number of times (including zero)
               # until bumping into a '+' char 
                 # it is OK not to find a char
  ) # end of the capturing group 
  # if found a group pattern add it to the list of found matches
  + # continue with next charater and look for a plus sign '+' 
  ?    # skip it from consideration if it is there, but also if  
           # there was not any plus sign 
             # restart the search for a match with the capturing group
         # what means:  
       # look for exactly one char of any kind (except new line)
           # look for any char except the plus sign 
               # ... and so on ... 
""", re.X)
print( 
  re.findall(regex, '+++Super+Alt'),
  re.findall(regex, '++Super++Alt'),
  re.findall(regex, 'Super+Alt+++'),
  re.findall(regex, 'Super+++Alt' ),
  sep='n'
)

Looking at this above you can judge yourself if the regular expression pattern is better readable this way or if it gets ‘buried’ between the huge volume of the explanation text.

Answered By: Claudio
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.