Regexp finding longest common prefix of two strings

Question

Is there a regexp which would find longest common prefix of two strings? And if this is not solvable by one regexp, what would be the most elegant piece of code or oneliner using regexp (perl, ruby, python, anything).

PS: I can do this easily programatically, I am asking rather for curiosity, because it seems to me that this could be solveable by regexp.

PPS: Extra bonus for O(n) solution using regexps. Come on, it should exist!

Asked By: gorn

||

Source

Answer 1

If there’s some character that neither string contains —, say, — you could write

"$first$second" =~ m/^(.*).*1/s;

and the longest common prefix would be saved as $1.

Edited to add: This is obviously very inefficient. I think that if efficiency is a concern, then this simply isn’t the approach we should be using; but we can at least improve it by changing .* to [^]* to prevent useless greediness that will just have to be backtracked again, and wrapping the second [^]* in (?>…) to prevent backtracking that can’t help. This:

"$first$second" =~ m/^([^]*)(?>[^]*)1/s;

This will yield the same result, but much more efficiently. (But still not nearly as efficiently as a straightforward non–regex-based approach. If the strings both have length n, I’d expect its worst case to take at least O(n²) time, whereas the straightforward non–regex-based approach would take O(n) time in its worst case.)

Answered By: ruakh

Answer 2

Here’s a Python one-liner:

>>> a = 'stackoverflow'
>>> b = 'stackofpancakes'
>>> a[:[x[0]==x[1] for x in zip(a,b)].index(0)]
0: 'stacko'
>>> a = 'nothing in'
>>> b = 'common'
>>> a[:[x[0]==x[1] for x in zip(a,b)].index(0)]
1: ''
>>>

Answered By: Tebbe

Answer 3

Non regexp, non duplicating string at each iteration solution:

def common_prefix(a, b):
   #sort strings so that we loop on the shorter one
   a, b = sorted((a,b), key=len)
   for index, letter in a:
      if letter != b[index]:
          return a[:index - 1]
   return a

Answered By: jsbueno

Answer 4

The problem you’re going to have is that a regular expression matches against one string at a time so isn’t intended for comparing two strings.

If there’s a character that you can be sure isn’t in either string you can use it separate them in a single string and then search using back references to groups.

So in the example below I’m using whitespace as the separator

>>> import re
>>> pattern = re.compile("(?P<prefix>S*)S*s+(?P=prefix)")
>>> pattern.match("stack stable").group('prefix')
'sta'
>>> pattern.match("123456 12345").group('prefix')
'12345'

Answered By: Dave Webb

Answer 5

I have the idea this is most inefficient. No err checking, etc.

#!/usr/bin/perl
use strict;
use warnings;

my($s1,$s2)=(@ARGV);
#find the shortest string put it into s1, if you will.

my $n=0;
my $reg;

foreach my $c (split(//,$s1)) { $reg .="($c"; $n++;}

$reg .= ")?" x $n;

$s2 =~ /$reg/; 

print $&,"n";

Answered By: Alien Life Form

Answer 6

I second ruakh’s answer for the regexp (with my suggested optimization in the comments). Simple to write, but not simple and efficient to run if the first string is long.

Here is an efficient, non-regexp, readable, one-line answer:

$ perl -E '($n,$l)=(0,length $ARGV[0]); while ($n < $l) { $s = substr($ARGV[0], $n, 1); last if $s ne substr($ARGV[1], $n, 1); $n++ } say substr($ARGV[0], 0, $n)' abce abcdef
abc

Answered By: dolmen

Answer 7

simple and efficient

def common_prefix(a,b):
  i = 0
  for i, (x, y) in enumerate(zip(a,b)):
    if x!=y: break
  return a[:i]

Answered By: ptitpoulpe

Answer 8

Inspired by ruakh’s answer, here is the O(n) regexp solution:

"$first $second" =~ m/^(.*?)(.).*1(?!2)/s;

Notes:
1. neither string contains
2. longest common prefix would be saved as $1
3. the space is important!

Edit: well it is not correct as rukach metions, but the idea is correct, but we should push regexp machine not to check the beginning letters repeatedly. The basic idea can be also rewritten in this perl oneliner.

perl -e ' $_="$first$secondn"; while(s/^(.)(.*?)1/2/gs) {print $1;}; '

I wonder if it can be incorporated back into regexp solution.

Answered By: gorn

Answer 9

Here’s one fairly efficient way which uses a regexp. The code is in Perl, but the principle should be adaptable to other languages:

my $xor = "$first" ^ "$second";    # quotes force string xor even for numbers
$xor =~ /^*/;                    # match leading null characters
my $common_prefix_length = $+[0];  # get length of match

(A subtlety worth noting is that Perl’s string XOR operator (^) in effect pads the shorter string with nulls to match the length of the longer one. Thus, if the strings might contain null characters, and if the shorter string happens to be a prefix of the longer one, the common prefix length calculated with this code might exceed the length of the shorter string.)

Answered By: Ilmari Karonen

Answer 10

Another attempt for O(n) solution:

$x=length($first); $_="$first$second"; s/((.)(?!.{$x}2)).*//s;

it depends whether .{n} is considered O(1) or O(n), I do not know how efficiently this is implemented.

Notes: 1. should not be in either string it is used as delimiter 2. result is in $_

Answered By: gorn

Answer 11

Using extended regular expressions as in Foma or Xfst.

def range(x) x.l;
def longest(L) L - range(range(L ∘ [[Σ:ε]+ [Σ:a]*]) ∘ [a:Σ]*); 
def prefix(W) range(W ∘ [Σ* Σ*:ε]);
def lcp(A,B) longest(prefix(A) ∩ prefix(B));

The hardest part here is to define “longest”. Generally speaking, to
optimize, you construct the set of non–optimal strings (worsening) and
then remove these (filtering).

This is really a purist approach, which avoids non-regular operations
such a capturing.

Answered By: Dale Gerdemann

Answer 12

Here’s an O(N) solution with Foma-like pseudocode regular expressions over triples (for lcp, you have two inputs and an output). To keep it simple, I assume a binary alphabet {a,b}:

def match {a:a:a, b:b:b};
def mismatch {a:b:ε, b:a:ε};
def lcp match* ∪ (match* mismatch (Σ:Σ:ε)*)

Now you just need a language that implements multi-tape transducers.

Answered By: Dale Gerdemann

Answer 13

Could be useful in some remote cases so here it goes:

RegEx only solution in 3 steps (couldn’t create a RegEx in one go):

String A: abcdef
String B: abcxef

1st pass: create RegEx from String A (part 1):
Match: /(.)/g
Replace: 1(
Result: a(b(c(d(e(f(
Explained demo: http://regex101.com/r/aJ4pY7
2nd pass: create RegEx from 1st pass result
Match: /^(.()(?=(.*)$)|G.(/g
Replace: 12)?+
Result: a(b(c(d(e(f()?+)?+)?+)?+)?+)?+
Explained demo: http://regex101.com/r/xJ7bK7
3rd pass: test String B against RegEx created in 2nd pass
Match: /a(b(c(d(e(f()?+)?+)?+)?+)?+)?+/
Result: abc (explained demo)

And here’s the glorified one-liner in PHP:

preg_match('/^'.preg_replace('/^(.()(?=(.*)$)|G.(/','12)?+',preg_replace('/(.)/','1(',$a)).'/',$b,$longest);

Code live at: http://codepad.viper-7.com/dCrqLa

Answered By: CSᵠ

Answer 14

Here is a solution I implemented for a leetcode problem:

def max_len(strs):
    """
    :type strs: List[str]
    :rtype: int
    """
    min_s = len(strs[0]);
    for s in strs:
        if (len(s) < min_s):
            min_s = len(s);
    return min_s;


class Solution2:
    def longestCommonPrefix(self, strs):
    """
    :type strs: List[str]
    :rtype: str
    """
    acc = -1;
    test_len = max_len(strs);
    for i in range(test_len):
        t = strs[0][i];
        acc2 = 0;
        for j in range(len(strs)):
            if (strs[j][i] == t):
                acc2 += 1;
        if (acc2 == len(strs)):
            acc += 1;

    if (acc == -1):
        return ""
    else:
        return strs[0][:acc + 1]

Hope this helps

Answered By: Sam Rothstein

Answer 15

string1=input()
string2=input()
string1=string1.lower()
string2=string2.lower()
l1=len(string1)
l2=len(string2)
min_len=min(l1,l2)
for i in range(min_len):
    if string1[i]!=string2[i]:
        break
if i==0:
    print(-1)
else:
    print(string2[:i])

Answered By: Naga Saranya

Regexp finding longest common prefix of two strings

Question:

Answers: