Mass string replace in python?

Question:

Say I have a string that looks like this:

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

You’ll notice a lot of locations in the string where there is an ampersand, followed by a character (such as “&y” and “&c”). I need to replace these characters with an appropriate value that I have in a dictionary, like so:

dict = {"&y":"33[0;30m",
        "&c":"33[0;31m",
        "&b":"33[0;32m",
        "&Y":"33[0;33m",
        "&u":"33[0;34m"}

What is the fastest way to do this? I could manually find all the ampersands, then loop through the dictionary to change them, but that seems slow. Doing a bunch of regex replaces seems slow as well (I will have a dictionary of about 30-40 pairs in my actual code).

Any suggestions are appreciated, thanks.

Edit:

As has been pointed out in comments throught this question, my dictionary is defined before runtime, and will never change during the course of the applications life cycle. It is a list of ANSI escape sequences, and will have about 40 items in it. My average string length to compare against will be about 500 characters, but there will be ones that are up to 5000 characters (although, these will be rare). I am also using Python 2.6 currently.

Edit #2
I accepted Tor Valamos answer as the correct one, as it not only gave a valid solution (although it wasn’t the best solution), but took all others into account and did a tremendous amount of work to compare all of them. That answer is one of the best, most helpful answers I have ever come across on StackOverflow. Kudos to you.

Asked By: Mike Trpcic

||

Answers:

mydict = {"&y":"33[0;30m",
          "&c":"33[0;31m",
          "&b":"33[0;32m",
          "&Y":"33[0;33m",
          "&u":"33[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

for k, v in mydict.iteritems():
    mystr = mystr.replace(k, v)

print mystr
The ←[0;30mquick ←[0;31mbrown ←[0;32mfox ←[0;33mjumps over the ←[0;34mlazy dog

I took the liberty of comparing a few solutions:

mydict = dict([('&' + chr(i), str(i)) for i in list(range(65, 91)) + list(range(97, 123))])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print 'Running %d times with string length %d and ' 
      'random inserts of lengths 0-20' % (rep, len(mystr))

# My solution
t = time()
for x in range(rep):
    for k, v in mydict.items():
        mystr.replace(k, v)
    #print(mystr)
print '%-30s' % 'Tor fixed & variable dict', time()-t

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(1)s', mystr) % mydict
print '%-30s' % 'Peter fixed & variable dict', time()-t

# Claudiu
def multiple_replace(dict, text): 
    # Create a regular expression  from the dictionary keys
    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))

    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

t = time()
for x in range(rep):
    multiple_replace(mydict, mystr)
print '%-30s' % 'Claudio variable dict', time()-t

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print '%-30s' % 'Claudio fixed dict', time()-t

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print '%-30s' % 'Andrew Y variable dict', time()-t

# Andrew Y - fixed
def repl(s):
  return mydict["&"+s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print '%-30s' % 'Andrew Y fixed dict', time()-t

Results in Python 2.6

Running 10000 times with string length 490 and random inserts of lengths 0-20
Tor fixed & variable dict      1.04699993134
Peter fixed & variable dict    0.218999862671
Claudio variable dict          2.48400020599
Claudio fixed dict             0.0940001010895
Andrew Y variable dict         0.0309998989105
Andrew Y fixed dict            0.0310001373291

Both claudiu’s and andrew’s solutions kept going into 0, so I had to increase it to 10 000 runs.

I ran it in Python 3 (because of unicode) with replacements of chars from 39 to 1024 (38 is ampersand, so I didn’t wanna include it). String length up to 10.000 including about 980 replacements with variable random inserts of length 0-20. The unicode values from 39 to 1024 causes characters of both 1 and 2 bytes length, which could affect some solutions.

mydict = dict([('&' + chr(i), str(i)) for i in range(39,1024)])

# random inserts between keys
from random import randint
rawstr = ''.join(mydict.keys())
mystr = ''
for i in range(0, len(rawstr), 2):
    mystr += chr(randint(65,91)) * randint(0,20) # insert between 0 and 20 chars

from time import time

# How many times to run each solution
rep = 10000

print('Running %d times with string length %d and ' 
      'random inserts of lengths 0-20' % (rep, len(mystr)))

# Tor Valamo - too long
#t = time()
#for x in range(rep):
#    for k, v in mydict.items():
#        mystr.replace(k, v)
#print('%-30s' % 'Tor fixed & variable dict', time()-t)

from re import sub, compile, escape

# Peter Hansen
t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', r'%(1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)

# Peter 2
def dictsub(m):
    return mydict[m.group()]

t = time()
for x in range(rep):
    sub(r'(&[a-zA-Z])', dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time()-t)

# Claudiu - too long
#def multiple_replace(dict, text): 
#    # Create a regular expression  from the dictionary keys
#    regex = compile("(%s)" % "|".join(map(escape, dict.keys())))
#
#    # For each match, look-up corresponding value in dictionary
#    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
#
#t = time()
#for x in range(rep):
#    multiple_replace(mydict, mystr)
#print('%-30s' % 'Claudio variable dict', time()-t)

# Claudiu - Precompiled
regex = compile("(%s)" % "|".join(map(escape, mydict.keys())))

t = time()
for x in range(rep):
    regex.sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time()-t)

# Separate setup for Andrew and gnibbler optimized dict
mydict = dict((k[1], v) for k, v in mydict.items())

# Andrew Y - variable dict
def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

def mysubst2(somestr, somedict):
  subs = somestr.split("&")
  return subs[0].join(map(lambda arg: somedict[arg[0:1]] + arg[1:], subs[1:]))

t = time()
for x in range(rep):
    mysubst(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict', time()-t)
t = time()
for x in range(rep):
    mysubst2(mystr, mydict)
print('%-30s' % 'Andrew Y variable dict 2', time()-t)

# Andrew Y - fixed
def repl(s):
  return mydict[s[0:1]] + s[1:]

t = time()
for x in range(rep):
    subs = mystr.split("&")
    res = subs[0] + "".join(map(repl, subs[1:]))
print('%-30s' % 'Andrew Y fixed dict', time()-t)

# gnibbler
t = time()
for x in range(rep):
    myparts = mystr.split("&")
    myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
    "".join(myparts)
print('%-30s' % 'gnibbler fixed & variable dict', time()-t)

Results:

Running 10000 times with string length 9491 and random inserts of lengths 0-20
Tor fixed & variable dict      0.0 # disqualified 329 secs
Peter fixed & variable dict    2.07799983025
Peter fixed dict               1.53100013733 
Claudio variable dict          0.0 # disqualified, 37 secs
Claudio fixed dict             1.5
Andrew Y variable dict         0.578000068665
Andrew Y variable dict 2       0.56299996376
Andrew Y fixed dict            0.56200003624
gnibbler fixed & variable dict 0.530999898911

(** Note that gnibbler’s code uses a different dict, where keys don’t have the ‘&’ included. Andrew’s code also uses this alternate dict, but it didn’t make much of a difference, maybe just 0.01x speedup.)

Answered By: Tor Valamo

Not sure about the speed of this solution either, but you could just loop through your dictionary and repeatedly call the built-in

str.replace(old, new)

This might perform decently well if the original string isn’t too long, but it would obviously suffer as the string got longer.

Answered By: Michael Patterson

If you really want to dig into the topic take a look at this: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

The obvious solution by iterating over the dictionary and replacing each element in the string takes O(n*m) time, where n is the size of the dictionary, m is the length of the string.

Whereas the Aho-Corasick-Algorithm finds all entries of the dictionary in O(n+m+f) where f is the number of found elements.

Answered By: WolfgangP

This seems like it does what you want – multiple string replace at once using RegExps. Here is the relevant code:

def multiple_replace(dict, text): 
    # Create a regular expression  from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))

    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)

print multiple_replace(dict, str)
Answered By: Claudiu

Try this, making use of regular expression substitution, and standard string formatting:

# using your stated values for str and dict:
>>> import re
>>> str = re.sub(r'(&[a-zA-Z])', r'%(1)s', str)
>>> str % dict
'The x1b[0;30mquick x1b[0;31mbrown x1b[0;32mfox x1b[0;33mjumps over the x1b[0;34mlazy dog'

The re.sub() call replaces all sequences of ampersand followed by single letter with the pattern %(..)s containing the same pattern.

The % formatting takes advantage of a feature of string formatting that can take a dictionary to specify the substitution, rather than the more commonly occurring positional arguments.

An alternative can do this directly in the re.sub, using a callback:

>>> import re
>>> def dictsub(m):
>>>    return dict[m.group()]
>>> str = re.sub(r'(&[a-zA-Z])', dictsub, str)

This time I’m using a closure to reference the dictionary from inside the callback function. This approach could give you a little more flexibility. For example, you could use something like dict.get(m.group(), '??') to avoid raising exceptions if you had strings with unrecognized code sequences.

(By the way, both “dict” and “str” are builtin functions, and you’ll get into trouble if you use those names in your own code much. Just in case you didn’t know that. They’re fine for a question like this of course.)

Edit: I decided to check Tor’s test code, and concluded that it’s nowhere near representative, and in fact buggy. The string generated doesn’t even have ampersands in it (!). The revised code below generates a representative dictionary and string, similar to the OP’s example inputs.

I also wanted to verify that each algorithm’s output was the same. Below is a revised test program, with only Tor’s, mine, and Claudiu’s code — because the others were breaking on the sample input. (I think they’re all brittle unless the dictionary maps basically all possible ampersand sequences, which Tor’s test code was doing.) This one properly seeds the random number generator so each run is the same. Finally, I added a minor variation using a generator which avoids some function call overhead, for a minor performance improvement.

from time import time
import string
import random
import re

random.seed(1919096)  # ensure consistent runs

# build dictionary with 40 mappings, representative of original question
mydict = dict(('&' + random.choice(string.letters), 'x1b[0;%sm' % (30+i)) for i in range(40))
# build simulated input, with mix of text, spaces, ampersands in reasonable proportions
letters = string.letters + ' ' * 12 + '&' * 6
mystr = ''.join(random.choice(letters) for i in range(1000))

# How many times to run each solution
rep = 10000

print('Running %d times with string length %d and %d ampersands'
    % (rep, len(mystr), mystr.count('&')))

# Tor Valamo
# fixed from Tor's test, so it actually builds up the final string properly
t = time()
for x in range(rep):
    output = mystr
    for k, v in mydict.items():
        output = output.replace(k, v)
print('%-30s' % 'Tor fixed & variable dict', time() - t)
# capture "known good" output as expected, to verify others
expected = output

# Peter Hansen

# build charset to use in regex for safe dict lookup
charset = ''.join(x[1] for x in mydict.keys())
# grab reference to method on regex, for speed
patsub = re.compile(r'(&[%s])' % charset).sub

t = time()
for x in range(rep):
    output = patsub(r'%(1)s', mystr) % mydict
print('%-30s' % 'Peter fixed & variable dict', time()-t)
assert output == expected

# Peter 2
def dictsub(m):
    return mydict[m.group()]

t = time()
for x in range(rep):
    output = patsub(dictsub, mystr)
print('%-30s' % 'Peter fixed dict', time() - t)
assert output == expected

# Peter 3 - freaky generator version, to avoid function call overhead
def dictsub(d):
    m = yield None
    while 1:
        m = yield d[m.group()]

dictsub = dictsub(mydict).send
dictsub(None)   # "prime" it
t = time()
for x in range(rep):
    output = patsub(dictsub, mystr)
print('%-30s' % 'Peter generator', time() - t)
assert output == expected

# Claudiu - Precompiled
regex_sub = re.compile("(%s)" % "|".join(mydict.keys())).sub

t = time()
for x in range(rep):
    output = regex_sub(lambda mo: mydict[mo.string[mo.start():mo.end()]], mystr)
print('%-30s' % 'Claudio fixed dict', time() - t)
assert output == expected

I forgot to include benchmark results before:

    Running 10000 times with string length 1000 and 96 ampersands
    ('Tor fixed & variable dict     ', 2.9890000820159912)
    ('Peter fixed & variable dict   ', 2.6659998893737793)
    ('Peter fixed dict              ', 1.0920000076293945)
    ('Peter generator               ', 1.0460000038146973)
    ('Claudio fixed dict            ', 1.562000036239624)

Also, snippets of the inputs and correct output:

mystr = 'lTEQDMAPvksk k&z Txp vrnhQ GHaO&GNFY&&a...'
mydict = {'&p': 'x1b[0;37m', '&q': 'x1b[0;66m', '&v': ...}
output = 'lTEQDMAPvksk k←[0;57m Txp vrnhQ GHaO←[0;67mNFY&&a P...'

Comparing with what I saw from Tor’s test code output:

mystr = 'VVVVVVVPPPPPPPPPPPPPPPXXXXXXXXYYYFFFFFFFFFFFFEEEEEEEEEEE...'
mydict = {'&p': '112', '&q': '113', '&r': '114', '&s': '115', ...}
output = # same as mystr since there were no ampersands inside
Answered By: Peter Hansen

If the number of keys in the list is large, and the number of the occurences in the string is low (and mostly zero), then you could iterate over the occurences of the ampersands in the string, and use the dictionary keyed by the first character of the substrings. I don’t code often in python so the style might be a bit off, but here is my take at it:

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

dict = {"&y":"33[0;30m",
        "&c":"33[0;31m",
        "&b":"33[0;32m",
        "&Y":"33[0;33m",
        "&u":"33[0;34m"}

def rep(s):
  return dict["&"+s[0:1]] + s[1:]

subs = str.split("&")
res = subs[0] + "".join(map(rep, subs[1:]))

print res

Of course there is a question what happens when there is an ampersand that is coming from the string itself, you would need to escape it in some way before feeding through this process, and then unescape after this process.

Of course, as is pretty much usual with the performance issues, timing the various approaches on your typical (and also worst-case) dataset and comparing them is a good thing to do.

EDIT: place it into a separate function to work with arbitrary dictionary:

def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0] + "".join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))

EDIT2: get rid of an unneeded concatenation, seems to still be a bit faster than the previous on many iterations.

def mysubst(somestr, somedict):
  subs = somestr.split("&")
  return subs[0].join(map(lambda arg: somedict["&" + arg[0:1]] + arg[1:], subs[1:]))
Answered By: Andrew Y
>>> a=[]
>>> str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"
>>> d={"&y":"33[0;30m",                                                              
... "&c":"33[0;31m",                                                                 
... "&b":"33[0;32m",                                                                 
... "&Y":"33[0;33m",                                                                 
... "&u":"33[0;34m"}     
>>> for item in str.split():
...  if item[:2] in d:
...    a.append(d[item[:2]]+item[2:])
...  else: a.append(item)
>>> print ' '.join(a)
Answered By: ghostdog74

A general solution for defining replacement rules is to use regex substitution using a function to provide the map (see re.sub()).

import re

str = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

dict = {"&y":"33[0;30m",
        "&c":"33[0;31m",
        "&b":"33[0;32m",
        "&Y":"33[0;33m",
        "&u":"33[0;34m"}

def programmaticReplacement( match ):
    return dict[ match.group( 1 ) ]

colorstring = re.sub( '(&.)', programmaticReplacement, str )

This is particularly nice for non-trivial substitutions (e.g anything requiring mathmatical operations to create the substitute).

Answered By: charstar

The problem with doing this mass replace in Python is immutability of the strings: every time you will replace one item in the string then entire new string will be reallocated again and again from the heap.

So if you want the fastest solution you either need to use mutable container (e.g. list), or write this machinery in the plain C (or better in Pyrex or Cython). In any case I’d suggest to write simple parser based on simple finite-state machine, and feed symbols of your string one by one.

Suggested solutions based on regexps working in similar way, because regexp working using fsm behind the scene.

Answered By: bialix

Here is a version using split/join

mydict = {"y":"33[0;30m",
          "c":"33[0;31m",
          "b":"33[0;32m",
          "Y":"33[0;33m",
          "u":"33[0;34m"}
mystr = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog"

myparts = mystr.split("&")
myparts[1:]=[mydict[x[0]]+x[1:] for x in myparts[1:]]
print "".join(myparts)

In case there are ampersands with invalid codes you can use this to preserve them

myparts[1:]=[mydict.get(x[0],"&"+x[0])+x[1:] for x in myparts[1:]]

Peter Hansen pointed out that this fails when there is double ampersand. In that case use this version

mystr = "The &yquick &cbrown &bfox &Yjumps over the &&ulazy dog"
myparts = mystr.split("&")
myparts[1:]=[mydict.get(x[:1],"&"+x[:1])+x[1:] for x in myparts[1:]]
print "".join(myparts)
Answered By: John La Rooy

Since someone mentioned using a simple parser, I thought I’d cook one up using pyparsing. By using pyparsing’s transformString method, pyparsing internally scans through the source string, and builds a list of the matching text and intervening text. When all is done, transformString then ”.join’s this list, so there is no performance problem in building up strings by increments. (The parse action defined for ANSIreplacer does the conversion from the matched &_ characters to the desired escape sequence, and replaces the matched text with the output of the parse action. Since only matching sequences will satisfy the parser expression, there is no need for the parse action to handle undefined &_ sequences.)

The FollowedBy(‘&’) is not strictly necessary, but it shortcuts the parsing process by verifying that the parser is actually positioned at an ampersand before doing the more expensive checking of all of the markup options.

from pyparsing import FollowedBy, oneOf

escLookup = {"&y":"33[0;30m",
            "&c":"33[0;31m",
            "&b":"33[0;32m",
            "&Y":"33[0;33m",
            "&u":"33[0;34m"}

# make a single expression that will look for a leading '&', then try to 
# match each of the escape expressions
ANSIreplacer = FollowedBy('&') + oneOf(escLookup.keys())

# add a parse action that will replace the matched text with the 
# corresponding ANSI sequence
ANSIreplacer.setParseAction(lambda toks: escLookup[toks[0]])

# now use the replacer to transform the test string; throw in some extra
# ampersands to show what happens with non-matching sequences
src = "The &yquick &cbrown &bfox &Yjumps over the &ulazy dog & &Zjumps back"
out = ANSIreplacer.transformString(src)
print repr(out)

Prints:

'The x1b[0;30mquick x1b[0;31mbrown x1b[0;32mfox x1b[0;33mjumps over 
 the x1b[0;34mlazy dog & &Zjumps back'

This will certainly not win any performance contests, but if your markup starts to get more complicated, then having a parser foundation will make it easier to extend.

Answered By: PaulMcG

Here is the C Extensions Approach for python

const char *dvals[]={
    //"0-64
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","","","","","","",
    "","","","","",
    //A-Z
    "","","","","",
    "","","","","",
    "","","","","",
    "","","","","",
    "","","","","33",
    "",
    //
    "","","","","","",
    //a-z
    "","32","31","","",
    "","","","","",
    "","","","","",
    "","","","","",
    "34","","","","30",
    ""
};

int dsub(char*d,char*s){
    char *ofs=d;
    do{
        if(*s=='&' && s[1]<='z' && *dvals[s[1]]){

            //33[0;
            *d++='\',*d++='0',*d++='3',*d++='3',*d++='[',*d++='0',*d++=';';

            //consider as fixed 2 digits
            *d++=dvals[s[1]][0];
            *d++=dvals[s[1]][1];

            *d++='m';

            s++; //skip

        //non &,invalid, unused (&) ampersand sequences will go here.
        }else *d++=*s;

    }while(*s++);

    return d-ofs-1;
}

Python codes I have tested

from mylib import *
import time

start=time.time()

instr="The &yquick &cbrown &bfox &Yjumps over the &ulazy dog, skip &Unknown.n"*100000
x=dsub(instr)

end=time.time()

print "time taken",end-start,",input str length",len(x)
print "first few lines"
print x[:1100]

Results

time taken 0.140000104904 ,input str length 11000000
first few lines
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.
The 33[0;30mquick 33[0;31mbrown 33[0;32mfox 33[0;33mjumps over the 33[0;34mlazy dog, skip &Unknown.

Its suppose to able to run at O(n), and
Only took 160 ms (avg) for 11 MB string in My Mobile Celeron 1.6 GHz PC

It will also skip unknown characters as is, for example &Unknown will return as is

Let me know If you have any problem with compiling, bugs, etc…

Answered By: YOU

try this

tr.replace(“&y”,dict[“&y”])

tr.replace(“&c”,dict[“&c”])

tr.replace(“&b”,dict[“&b”])

tr.replace(“&Y”,dict[“&Y”])

tr.replace(“&u”,dict[“&u”])

Answered By: fahad