How to do CamelCase split in python

Question

What I was trying to achieve, was something like this:

>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']

So I searched and found this perfect regular expression:

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

As the next logical step I tried:

>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']

Why does this not work, and how do I achieve the result from the linked question in python?

Edit: Solution summary

I tested all provided solutions with a few test cases:

string:                 ''
AplusKminus:            ['']
casimir_et_hippolyte:   []
two_hundred_success:    []
kalefranz:              string index out of range # with modification: either [] or ['']

string:                 ' '
AplusKminus:            [' ']
casimir_et_hippolyte:   []
two_hundred_success:    [' ']
kalefranz:              [' ']

string:                 'lower'
all algorithms:         ['lower']

string:                 'UPPER'
all algorithms:         ['UPPER']

string:                 'Initial'
all algorithms:         ['Initial']

string:                 'dromedaryCase'
AplusKminus:            ['dromedary', 'Case']
casimir_et_hippolyte:   ['dromedary', 'Case']
two_hundred_success:    ['dromedary', 'Case']
kalefranz:              ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']

string:                 'CamelCase'
all algorithms:         ['Camel', 'Case']

string:                 'ABCWordDEF'
AplusKminus:            ['ABC', 'Word', 'DEF']
casimir_et_hippolyte:   ['ABC', 'Word', 'DEF']
two_hundred_success:    ['ABC', 'Word', 'DEF']
kalefranz:              ['ABCWord', 'DEF']

In summary you could say the solution by @kalefranz does not match the question (see the last case) and the solution by @casimir et hippolyte eats a single space, and thereby violates the idea that a split should not change the individual parts. The only difference among the remaining two alternatives is that my solution returns a list with the empty string on an empty string input and the solution by @200_success returns an empty list.
I don’t know how the python community stands on that issue, so I say: I am fine with either one. And since 200_success’s solution is simpler, I accepted it as the correct answer.

Asked By: AplusKminus

||

Source

Answer 1

The documentation for python’s re.split says:

Note that split will never split a string on an empty pattern match.

When seeing this:

>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']

it becomes clear, why the split does not work as expected. The remodule finds empty matches, just as intended by the regular expression.

Since the documentation states that this is not a bug, but rather intended behavior, you have to work around that when trying to create a camel case split:

def camel_case_split(identifier):
    matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
    split_string = []
    # index of beginning of slice
    previous = 0
    for match in matches:
        # get slice
        split_string.append(identifier[previous:match.start()])
        # advance index
        previous = match.start()
    # get remaining string
    split_string.append(identifier[previous:])
    return split_string

Answered By: AplusKminus

Answer 2

Here’s another solution that requires less code and no complicated regular expressions:

def camel_case_split(string):
    bldrs = [[string[0].upper()]]
    for c in string[1:]:
        if bldrs[-1][-1].islower() and c.isupper():
            bldrs.append([c])
        else:
            bldrs[-1].append(c)
    return [''.join(bldr) for bldr in bldrs]

Edit

The above code contains an optimization that avoids rebuilding the entire string with every appended character. Leaving out that optimization, a simpler version (with comments) might look like

def camel_case_split2(string):
    # set the logic for creating a "break"
    def is_transition(c1, c2):
      return c1.islower() and c2.isupper()

    # start the builder list with the first character
    # enforce upper case
    bldr = [string[0].upper()]
    for c in string[1:]:
        # get the last character in the last element in the builder
        # note that strings can be addressed just like lists
        previous_character = bldr[-1][-1]
        if is_transition(previous_character, c):
            # start a new element in the list
            bldr.append(c)
        else:
            # append the character to the last string
            bldr[-1] += c
    return bldr

Answered By: kalefranz

Answer 3

As @AplusKminus has explained, re.split() never splits on an empty pattern match. Therefore, instead of splitting, you should try finding the components you are interested in.

Here is a solution using re.finditer() that emulates splitting:

def camel_case_split(identifier):
    matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
    return [m.group(0) for m in matches]

Answered By: 200_success

Answer 4

Most of the time when you don’t need to check the format of a string, a global research is more simple than a split (for the same result):

re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')

returns

['Camel', 'Case', 'XYZ']

To deal with dromedary too, you can use:

re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')

Note: (?=[A-Z]|$) can be shorten using a double negation (a negative lookahead with a negated character class): (?![^A-Z])

Answered By: Casimir et Hippolyte

Answer 5

Use re.sub() and split()

import re

name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' 1', re.sub('([A-Z]+)', r' 1', name)).split()

Result

'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']

Answered By: Jossef Harush Kadouri

Answer 6

I just stumbled upon this case and wrote a regular expression to solve it. It should work for any group of words, actually.

RE_WORDS = re.compile(r'''
    # Find words in a string. Order matters!
    [A-Z]+(?=[A-Z][a-z]) |  # All upper case before a capitalized word
    [A-Z]?[a-z]+ |  # Capitalized words / all lower case
    [A-Z]+ |  # All upper case
    d+  # Numbers
''', re.VERBOSE)

The key here is the lookahead on the first possible case. It will match (and preserve) uppercase words before capitalized ones:

assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']

Answered By: emyller

Answer 7

I think below is the optimim

Def count_word():
Return(re.findall(‘[A-Z]?[a-z]+’, input(‘please enter your string’))

Print(count_word())

Answered By: Ahmoody

Answer 8

I know that the question added the tag of regex. But still, I always try to stay as far away from regex as possible. So, here is my solution without regex:

def split_camel(text, char):
    if len(text) <= 1: # To avoid adding a wrong space in the beginning
        return text+char
    if char.isupper() and text[-1].islower(): # Regular Camel case
        return text + " " + char
    elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
        return text[:-1] + " " + text[-1] + char
    else: # Do nothing part
        return text + char

text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"

EDIT:
As suggested, here is the code to put the functionality in a single function.

def split_camel(text):
    def splitter(text, char):
        if len(text) <= 1: # To avoid adding a wrong space in the beginning
            return text+char
        if char.isupper() and text[-1].islower(): # Regular Camel case
            return text + " " + char
        elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
            return text[:-1] + " " + text[-1] + char
        else: # Do nothing part
            return text + char
    converted_text = reduce(splitter, text, "")
    return converted_text.split(" ")

split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']

Answered By: thiruvenkadam

Answer 9

Putting a more comprehensive approach otu ther. It takes care of several issues like numbers, string starting with lower case, single letter words etc.

def camel_case_split(identifier, remove_single_letter_words=False):
    """Parses CamelCase and Snake naming"""
    concat_words = re.split('[^a-zA-Z]+', identifier)

    def camel_case_split(string):
        bldrs = [[string[0].upper()]]
        string = string[1:]
        for idx, c in enumerate(string):
            if bldrs[-1][-1].islower() and c.isupper():
                bldrs.append([c])
            elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
                bldrs.append([c])
            else:
                bldrs[-1].append(c)

        words = [''.join(bldr) for bldr in bldrs]
        words = [word.lower() for word in words]
        return words
    words = []
    for word in concat_words:
        if len(word) > 0:
            words.extend(camel_case_split(word))
    if remove_single_letter_words:
        subset_words = []
        for word in words:
            if len(word) > 1:
                subset_words.append(word)
        if len(subset_words) > 0:
            words = subset_words
    return words

Answered By: datarpit

Answer 10

My requirement was a bit more specific than the OP. In particular, in addition to handling all OP cases, I needed the following which the other solutions do not provide:
– treat all non-alphanumeric input (e.g. !@#$%^&*() etc) as a word separator
– handle digits as follows:
– cannot be in the middle of a word
– cannot be at the beginning of the word unless the phrase starts with a digit

def splitWords(s):
    new_s = re.sub(r'[^a-zA-Z0-9]', ' ',                  # not alphanumeric
        re.sub(r'([0-9]+)([^0-9])', '\1 \2',            # digit followed by non-digit
            re.sub(r'([a-z])([A-Z])','\1 \2',           # lower case followed by upper case
                re.sub(r'([A-Z])([A-Z][a-z])', '\1 \2', # upper case followed by upper case followed by lower case
                    s
                )
            )
        )
    )
    return [x for x in new_s.split(' ') if x]

Output:

for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
    print test + ':' + str(splitWords(test))

:[]
 :[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']

Answered By: mwag

Answer 11

Working solution, without regexp

I am not that good at regexp. I like to use them for search/replace in my IDE but I try to avoid them in programs.

Here is a quite straightforward solution in pure python:

def camel_case_split(s):
    idx = list(map(str.isupper, s))
    # mark change of case
    l = [0]
    for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
        if x and not y:  # "Ul"
            l.append(i)
        elif not x and y:  # "lU"
            l.append(i+1)
    l.append(len(s))
    # for "lUl", index of "U" will pop twice, have to filter that
    return [s[x:y] for x, y in zip(l, l[1:]) if x < y]

And some tests

TESTS = [
    ("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
    ("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
    ("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
    ("XYZCamelCaseXYZ", ['XYZ', 'Camel', 'Case', 'XYZ']),
    ("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
    ("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
    ("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
    ("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
    ("Ta", ['Ta']),
    ("aT", ['a', 'T']),
    ("a", ['a']),
    ("T", ['T']),
    ("", []),
]

def test():
    for (q,a) in TESTS:
        assert camel_case_split(q) == a

if __name__ == "__main__":
    test()

Edit: a solution which streams data in one pass

This solution leverages the fact that the decision to split word or not can be taken locally, just considering the current character and the previous one.

def camel_case_split(s):
    u = True  # case of previous char
    w = b = ''  # current word, buffer for last uppercase letter
    for c in s:
        o = c.isupper()
        if u and o:
            w += b
            b = c
        elif u and not o:
            if len(w)>0:
                yield w
            w = b + c
            b = ''
        elif not u and o:
            yield w
            w = ''
            b = c
        else:  # not u and not o:
            w += c
        u = o
    if len(w)>0 or len(b)>0:  # flush
        yield w + b

It is theoretically faster and lesser memory usage.

same tests suite applies

but list must be built by caller

def test():
    for (q,a) in TESTS:
        r = list(camel_case_split(q))
        print(q,a,r)
        assert r == a

Try it online

Answered By: Setop

Answer 12

This solution also supports numbers, spaces, and auto remove underscores:

def camel_terms(value):
    return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)

Some tests:

tests = [
    "XYZCamelCase",
    "CamelCaseXYZ",
    "Camel_CaseXYZ",
    "3DCamelCase",
    "Camel5Case",
    "Camel5Case5D",
    "Camel Case XYZ"
]

for test in tests:
    print(test, "=>", camel_terms(test))

results:

XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']

Answered By: mnesarco

Answer 13

Simple solution:

re.sub(r"([a-z0-9])([A-Z])", r"1 2", str(text))

Answered By: vbfh

Answer 14

import re

re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result

# '(?<=[a-z])'         --> means preceding lowercase char (group A)
# '(?=[A-Z])'          --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on

Answered By: endusol

Answer 15

Maybe this will be enough to for some people:

a = "SomeCamelTextUpper"
def camelText(val):
    return ''.join([' ' + i if i.isupper() else i for i in val]).strip()
print(camelText(a))

It dosen’t work with the type "CamelXYZ", but with ‘typical’ CamelCase scenario should work just fine.

Answered By: Adrian Najmrodzki

Answer 16

Based on @Setop’s answer, I added support for numbers, whitespaces, underscores and dots:

def _camel_case_split_iter(string: str) -> Iterable[str]:
    previous_char_upper = True
    previous_char_digit = True
    curr_word = ""
    upper_buffer = ""  # buffer for last uppercase letter
    for c in string:
        curr_char_upper = c.isupper()
        curr_char_digit = c.isdigit()
        if c.isspace() or c in ["_", "."]:
            if len(curr_word) > 0 or len(upper_buffer) > 0:
                yield curr_word + upper_buffer
                curr_word = upper_buffer = ""
        elif previous_char_upper and curr_char_upper:
            curr_word += upper_buffer
            upper_buffer = c
        elif previous_char_upper and not curr_char_upper and not curr_char_digit:
            if len(curr_word) > 0:
                yield curr_word
            curr_word = upper_buffer + c
            upper_buffer = ""
        elif not previous_char_upper and curr_char_upper:
            if len(curr_word) > 0:
                yield curr_word
                curr_word = ""
            upper_buffer = c
        elif (not previous_char_digit and curr_char_digit) or (previous_char_digit and not curr_char_digit):
            if len(curr_word) > 0 or len(upper_buffer) > 0:
                yield curr_word + upper_buffer
                upper_buffer = ""
            curr_word = c
        else:
            curr_word += c
        previous_char_upper = curr_char_upper
        previous_char_digit = curr_char_digit
    if len(curr_word) > 0 or len(upper_buffer) > 0:  # flush
        yield curr_word + upper_buffer


def camel_case_split(string: str) -> list[str]:
    """
    Split CamelCase string to words.

    >>> camel_case_split("XYZCamelCaseXYZ")
    ['XYZ', 'Camel', 'Case', 'XYZ']
    >>> camel_case_split("Ta")
    ['Ta']
    >>> camel_case_split("aT")
    ['a', 'T']
    >>> camel_case_split("_aAa_bBb__CCC__")
    ['a', 'Aa', 'b', 'Bb', 'CCC']
    >>> camel_case_split("10Camel20CaseXYZ30")
    ['10', 'Camel', '20', 'Case', 'XYZ', '30']
    >>> camel_case_split(" CamelCase camel case ")
    ['Camel', 'Case', 'camel', 'case']
    """
    return list(_camel_case_split_iter(string))

All tests:

@pytest.mark.parametrize(
    "string,expected",
    [
        ("XYZCamelCase", ["XYZ", "Camel", "Case"]),
        ("CamelCaseXYZ", ["Camel", "Case", "XYZ"]),
        ("CamelCaseXYZa", ["Camel", "Case", "XY", "Za"]),
        ("XYZCamelCaseXYZ", ["XYZ", "Camel", "Case", "XYZ"]),
        ("aCamelCaseWordT", ["a", "Camel", "Case", "Word", "T"]),
        ("CamelCaseWordT", ["Camel", "Case", "Word", "T"]),
        ("CamelCaseWordTa", ["Camel", "Case", "Word", "Ta"]),
        ("aCamelCaseWordTa", ["a", "Camel", "Case", "Word", "Ta"]),
        ("Ta", ["Ta"]),
        ("aT", ["a", "T"]),
        ("a", ["a"]),
        ("T", ["T"]),
        ("", []),
        ("A_B", ["A", "B"]),
        ("a_b", ["a", "b"]),
        ("Camel_CaseXYZ", ["Camel", "Case", "XYZ"]),
        ("aAa_bBb", ["a", "Aa", "b", "Bb"]),
        ("aAaTTT_b", ["a", "Aa", "TTT", "b"]),
        ("__CCcCccc__DDD__eee_fGG__", ["C", "Cc", "Cccc", "DDD", "eee", "f", "GG"]),
        ("__a", ["a"]),
        ("__A", ["A"]),
        ("a__", ["a"]),
        ("A__", ["A"]),
        ("____", []),
        ("3DCamelCase", ["3", "D", "Camel", "Case"]),
        ("330DCamelCase", ["330", "D", "Camel", "Case"]),
        ("330CamelCase", ["330", "Camel", "Case"]),
        ("Camel5Case", ["Camel", "5", "Case"]),
        ("Camel50Case", ["Camel", "50", "Case"]),
        ("Camel501Case", ["Camel", "501", "Case"]),
        ("CamelCase501", ["Camel", "Case", "501"]),
        ("CamelCaseA501", ["Camel", "Case", "A", "501"]),
        ("CamelCaseAA501", ["Camel", "Case", "AA", "501"]),
        ("CamelCase501a", ["Camel", "Case", "501", "a"]),
        ("Camel5Case5D", ["Camel", "5", "Case", "5", "D"]),
        ("Camel5Case50DC", ["Camel", "5", "Case", "50", "DC"]),
        ("Camel5Case50DCCase", ["Camel", "5", "Case", "50", "DC", "Case"]),
        ("camel.case", ["camel", "case"]),
        ("Camel Case XYZ", ["Camel", "Case", "XYZ"]),
        (" Camel Case 1 3XYZ _ AA ", ["Camel", "Case", "1", "3", "XYZ", "AA"]),
        ("camelncase", ["camel", "case"]),
    ],
)
def test_camel_case_split(string, expected):
    res = camel_case_split(string)
    assert res == expected

But I believe @mnesarco’s answer is also very good, it’s X5 faster and behaves almost the same.

The only difference (that I know) is how numbers with uppercase are handled:

"3DAndD3ARESoComplicated" -> 
# My answer:
['3', 'D', 'And', 'D', '3', 'ARE', 'So', 'Complicated'] 
# mnesarco's answer:
['3D', 'And', 'D3ARE', 'So', 'Complicated']

Answered By: Noam Nol