Regex match all words except those between quotes

Question:

In this example I want to select all words, except those between quotes (i.e. "results", "items", "packages", "settings" and "build_type", but not "compiler.version").

results[0].items[0].packages[0].settings["compiler.version"] 
results[0].items[0].packages[0].settings.build_type

Here’s what I know: I can target all words with

[a-z_]+

and then target what’s in between quotes with this:

(?<=")[w.]+(?=")

Is there any way to match the difference between the results of the first and second regex? (i.e. words except if they are surrounded by double quotes)

Here‘s a regex playground with the example for convenience.

Asked By: jlo

||

Answers:

You can match strings between double quotes and then match and capture words optionally followed with dot separated words:

list(filter(None, re.findall(r'"[^"]*"|([a-z_]w*(?:.[a-z_]w*)*)', text, re.ASCII | re.I)))

See the regex demo. Details:

  • "[^"]*" – a " char, zero or more chars other than " and then a " char
  • | – or
  • ([a-z_]w*(?:.[a-z_]w*)*) – Group 1: a letter or underscore followed with zero or more word chars and then zero or more sequences of a . and then a letter or underscore followed with zero or more word chars.

See the Python demo:

import re
text = 'results[0].items[0].packages[0].settings["compiler.version"] '
print(list(filter(None, re.findall(r'"[^"]*"|([a-z_]w*(?:.[a-z_]w*)*)', text, re.ASCII | re.I))))
# => ['results', 'items', 'packages', 'settings']

The re.ASCII option is used to make w match [a-zA-Z0-9_] without accounting for Unicode chars.

Answered By: Wiktor Stribiżew

Here is a simpler version which works with the example you provided.

(?<!")b[a-z_]+b(?!")

Here’s a demo

Edit: This does work for the example you provided. However, it has some flaws because it only avoids matching words that are touching a ". Therefore, if you have several words within the quotes, it will match any inner words that are not touching a ".

Working on improving this solution and will edit this post if new updates develop.

Answered By: user17038038

A word is not within a double-quoted substring if and only it is followed in the string by an even number of double-quotes (assuming the string is properly formatted and therefore contains an even number of double-quotes). You can use the following regular expression to match strings that are not contained within double-quoted substrings.

[a-z_]+(?=(?:(?:[^"n]*"){2})*[^"n]*$)

Demo

The regular expression can be broken down as follows (alternatively, hover the cursor over each part of the expression at the link to obtain an explanation of its function).

[a-z_]+         # match one or more of the indicated characters
(?=             # begin a positive lookahead
  (?:           # begin an outer non-capture group
    (?:         # begin an inner non-capture group
      [^"n]*  # match zero or more characters other than " and n 
      "        # match "
    ){2}        # end inner non-capture group and execute twice
  )*            # end outer non-capture group and execute zero or more times
  [^"n]*      # match zero or more characters other than " and n 
  $             # match end of string
)               # end positive lookahead
Answered By: Cary Swoveland
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.