Find the longest common starting substring in a set of strings

Question

This is a challenge to come up with the most elegant JavaScript, Ruby or other solution to a relatively trivial problem.

This problem is a more specific case of the Longest common substring problem. I need to only find the longest common starting substring in an array. This greatly simplifies the problem.

For example, the longest substring in [interspecies, interstelar, interstate] is “inters”. However, I don’t need to find “ific” in [specifics, terrific].

I’ve solved the problem by quickly coding up a solution in JavaScript as a part of my answer about shell-like tab-completion (test page here). Here is that solution, slightly tweaked:

function common_substring(data) {
  var i, ch, memo, idx = 0
  do {
    memo = null
    for (i=0; i < data.length; i++) {
      ch = data[i].charAt(idx)
      if (!ch) break
      if (!memo) memo = ch
      else if (ch != memo) break
    }
  } while (i == data.length && idx < data.length && ++idx)

  return (data[0] || '').slice(0, idx)
}

This code is available in this Gist along with a similar solution in Ruby. You can clone the gist as a git repo to try it out:

$ git clone git://gist.github.com/257891.git substring-challenge

I’m not very happy with those solutions. I have a feeling they might be solved with more elegance and less execution complexity—that’s why I’m posting this challenge.

I’m going to accept as an answer the solution I find the most elegant or concise. Here is for instance a crazy Ruby hack I come up with—defining the & operator on String:

# works with Ruby 1.8.7 and above
class String
  def &(other)
    difference = other.to_str.each_char.with_index.find { |ch, idx|
      self[idx].nil? or ch != self[idx].chr
    }
    difference ? self[0, difference.last] : self
  end
end

class Array
  def common_substring
    self.inject(nil) { |memo, str| memo.nil? ? str : memo & str }.to_s
  end
end

Solutions in JavaScript or Ruby are preferred, but you can show off clever solution in other languages as long as you explain what’s going on. Only code from standard library please.

Update: my favorite solutions

I’ve chosen the JavaScript sorting solution by kennebec as the “answer” because it struck me as both unexpected and genius. If we disregard the complexity of actual sorting (let’s imagine it’s infinitely optimized by the language implementation), the complexity of the solution is just comparing two strings.

Ruby, 71 chars

def f(a)b=a[0];b[0,(0..b.size).find{|n|a.any?{|i|i[0,n]!=b[0,n]}}-1]end

If you want that unrolled it looks like this:

def f(words)
  first_word = words[0];
  first_word[0, (0..(first_word.size)).find { |num_chars|
    words.any? { |word| word[0, num_chars] != first_word[0, num_chars] }
  } - 1]
end

Answered By: Jordan Running

Answer 6

Ruby one-liner:

l=strings.inject{|l,s| l=l.chop while l!=s[0...l.length];l}

Answered By: AShelly

Answer 7

You just need to traverse all strings until they differ, then take the substring up to this point.

Pseudocode:

loop for i upfrom 0
     while all strings[i] are equal
     finally return substring[0..i]

Common Lisp:

(defun longest-common-starting-substring (&rest strings)
  (loop for i from 0 below (apply #'min (mapcar #'length strings))
     while (apply #'char=
                  (mapcar (lambda (string) (aref string i))
                          strings))
     finally (return (subseq (first strings) 0 i))))

Answered By: Svante

Answer 8

I would do the following:

Take the first string of the array as the initial starting substring.
Take the next string of the array and compare the characters until the end of one of the strings is reached or a mismatch is found. If a mismatch is found, reduce starting substring to the length where the mismatch was found.
Repeat step 2 until all strings have been tested.

Here’s a JavaScript implementation:

var array = ["interspecies", "interstelar", "interstate"],
    prefix = array[0],
    len = prefix.length;
for (i=1; i<array.length; i++) {
    for (j=0, len=Math.min(len,array[j].length); j<len; j++) {
        if (prefix[j] != array[i][j]) {
            len = j;
            prefix = prefix.substr(0, len);
            break;
        }
    }
}

Answered By: Gumbo

Answer 9

In Python:

>>> from os.path import commonprefix
>>> commonprefix('interspecies interstelar interstate'.split())
'inters'

Answered By: Roberto Bonvallet

Answer 10

In Python I wouldn’t use anything but the existing commonprefix function I showed in another answer, but I couldn’t help to reinvent the wheel :P. This is my iterator-based approach:

>>> a = 'interspecies interstelar interstate'.split()
>>>
>>> from itertools import takewhile, chain, izip as zip, imap as map
>>> ''.join(chain(*takewhile(lambda s: len(s) == 1, map(set, zip(*a)))))
'inters'

Edit: Explanation of how this works.

zip generates tuples of elements taking one of each item of a at a time:

In [6]: list(zip(*a))  # here I use list() to expand the iterator
Out[6]:
[('i', 'i', 'i'),
 ('n', 'n', 'n'),
 ('t', 't', 't'),
 ('e', 'e', 'e'),
 ('r', 'r', 'r'),
 ('s', 's', 's'),
 ('p', 't', 't'),
 ('e', 'e', 'a'),
 ('c', 'l', 't'),
 ('i', 'a', 'e')]

By mapping set over these items, I get a series of unique letters:

In [7]: list(map(set, _))  # _ means the result of the last statement above
Out[7]:
[set(['i']),
 set(['n']),
 set(['t']),
 set(['e']),
 set(['r']),
 set(['s']),
 set(['p', 't']),
 set(['a', 'e']),
 set(['c', 'l', 't']),
 set(['a', 'e', 'i'])]

takewhile(predicate, items) takes elements from this while the predicate is True; in this particular case, when the sets have one element, i.e. all the words have the same letter at that position:

In [8]: list(takewhile(lambda s: len(s) == 1, _))
Out[8]:
[set(['i']),
 set(['n']), 
 set(['t']), 
 set(['e']), 
 set(['r']), 
 set(['s'])]

At this point we have an iterable of sets, each containing one letter of the prefix we were looking for. To construct the string, we chain them into a single iterable, from which we get the letters to join into the final string.

The magic of using iterators is that all items are generated on demand, so when takewhile stops asking for items, the zipping stops at that point and no unnecessary work is done. Each function call in my one-liner has a implicit for and an implicit break.

Answered By: Roberto Bonvallet

Answer 11

Javascript clone of AShelly‘s excellent answer.

Requires Array#reduce which is supported only in firefox.

var strings = ["interspecies", "intermediate", "interrogation"]
var sub = strings.reduce(function(l,r) { 
    while(l!=r.slice(0,l.length)) {  
        l = l.slice(0, -1);
    }
    return l;
});

Answered By: Chetan S

Answer 12

It’s a matter of taste, but this is a simple javascript version:
It sorts the array, and then looks just at the first and last items.

//longest common starting substring in an array

function sharedStart(array){
    var A= array.concat().sort(), 
    a1= A[0], a2= A[A.length-1], L= a1.length, i= 0;
    while(i<L && a1.charAt(i)=== a2.charAt(i)) i++;
    return a1.substring(0, i);
}

DEMOS

sharedStart(['interspecies', 'interstelar', 'interstate'])  //=> 'inters'
sharedStart(['throne', 'throne'])                           //=> 'throne'
sharedStart(['throne', 'dungeon'])                          //=> ''
sharedStart(['cheese'])                                     //=> 'cheese'
sharedStart([])                                             //=> ''
sharedStart(['prefix', 'suffix'])                           //=> ''

Answered By: kennebec

Answer 13

It’s not code golf, but you asked for somewhat elegant, and I tend to think recursion is fun. Java.

/** Recursively find the common prefix. */
public String findCommonPrefix(String[] strings) {

    int minLength = findMinLength(strings);

    if (isFirstCharacterSame(strings)) {
        return strings[0].charAt(0) + findCommonPrefix(removeFirstCharacter(strings));
    } else {
        return "";
    }
}

/** Get the minimum length of a string in strings[]. */
private int findMinLength(final String[] strings) {
    int length = strings[0].size();
    for (String string : strings) {
        if (string.size() < length) {
            length = string.size();
        }
    }
    return length;
}

/** Compare the first character of all strings. */
private boolean isFirstCharacterSame(String[] strings) {
    char c = string[0].charAt(0);
    for (String string : strings) {
        if (c != string.charAt(0)) return false;
    }

    return true;
}

/** Remove the first character of each string in the array, 
    and return a new array with the results. */
private String[] removeFirstCharacter(String[] source) {
    String[] result = new String[source.length];
    for (int i=0; i<result.length; i++) {
        result[i] = source[i].substring(1); 
    }
    return result;
}

Answered By: Dean J

Answer 14

My Haskell one-liner:

import Data.List

commonPre :: [String] -> String
commonPre = map head . takeWhile ((x:xs)-> all (==x) xs) . transpose

EDIT: barkmadley gave a good explanation of the code below. I’d also add that haskell uses lazy evaluation, so we can be lazy about our use of transpose; it will only transpose our lists as far as necessary to find the end of the common prefix.

Answered By: jberryman

Answer 15

A ruby version based on @Svante’s algorithm. Runs ~3x as fast as my first one.

 def common_prefix set 
   i=0
   rest=set[1..-1]
   set[0].each_byte{|c|
     rest.each{|e|return set[0][0...i] if e[i]!=c}
     i+=1
   }
   set
 end

Answered By: AShelly

Answer 16

This one is very similar to Roberto Bonvallet’s solution, except in ruby.

chars = %w[interspecies interstelar interstate].map {|w| w.split('') }
chars[0].zip(*chars[1..-1]).map { |c| c.uniq }.take_while { |c| c.size == 1 }.join

The first line replaces each word with an array of chars. Next, I use zip to create this data structure:

[["i", "i", "i"], ["n", "n", "n"], ["t", "t", "t"], ...

map and uniq reduce this to [["i"],["n"],["t"], ...

take_while pulls the chars off the array until it finds one where the size isn’t one (meaning not all chars were the same). Finally, I join them back together.

Answered By: Ben Marini

Answer 17

Just for the fun of it, here’s a version written in (SWI-)PROLOG:

common_pre([[C|Cs]|Ss], [C|Res]) :-
  maplist(head_tail(C), [[C|Cs]|Ss], RemSs), !,
  common_pre(RemSs, Res).
common_pre(_, []).

head_tail(H, [H|T], T).

Running:

?- S=["interspecies", "interstelar", "interstate"], common_pre(S, CP), string_to_list(CPString, CP).

Gives:

CP = [105, 110, 116, 101, 114, 115],
CPString = "inters".

Explanation:

(SWI-)PROLOG treats strings as lists of character codes (numbers). All the predicate common_pre/2 does is recursively pattern-match to select the first code (C) from the head of the first list (string, [C|Cs]) in the list of all lists (all strings, [[C|Cs]|Ss]), and appends the matching code C to the result iff it is common to all (remaining) heads of all lists (strings), else it terminates.

Nice, clean, simple and efficient… 🙂

Answered By: user206428

Answer 18

Yet another way to do it: use regex greed.

words = %w(interspecies interstelar interstate)
j = '='
str = ['', *words].join(j)
re = "[^#{j}]*"

str =~ /A
    (?: #{j} ( #{re} ) #{re} )
    (?: #{j}    1     #{re} )*
z/x

p $1

And the one-liner, courtesy of mislav (50 characters):

p ARGV.join(' ').match(/^(w*)w*(?: 1w*)*$/)[1]

Answered By: FMc

Answer 19

A javascript version based on @Svante’s algorithm:

function commonSubstring(words){
    var iChar, iWord,
        refWord = words[0],
        lRefWord = refWord.length,
        lWords = words.length;
    for (iChar = 0; iChar < lRefWord; iChar += 1) {
        for (iWord = 1; iWord < lWords; iWord += 1) {
            if (refWord[iChar] !== words[iWord][iChar]) {
                return refWord.substring(0, iChar);
            }
        }
    }
    return refWord;
}

Answered By: Mariano Desanze

Answer 20

Instead of sorting, you could just get the min and max of the strings.

To me, elegance in a computer program is a balance of speed and simplicity.
It should not do unnecessary computation, and it should be simple enough to make its correctness evident.

I could call the sorting solution “clever”, but not “elegant”.

Answered By: Florian F

Answer 21

Combining answers by kennebec, Florian F and jberryman yields the following Haskell one-liner:

commonPrefix l = map fst . takeWhile (uncurry (==)) $ zip (minimum l) (maximum l)

With Control.Arrow one can get a point-free form:

commonPrefix = map fst . takeWhile (uncurry (==)) . uncurry zip . (minimum &&& maximum)

Answered By: Bolo

Answer 22

Oftentimes it’s more elegant to use a mature open source library instead of rolling your own. Then, if it doesn’t completely suit your needs, you can extend it or modify it to improve it, and let the community decide if that belongs in the library.

diff-lcs is a good Ruby gem for least common substring.

Answered By: Grant Hutchins

Answer 23

My solution in Java:

public static String compute(Collection<String> strings) {
    if(strings.isEmpty()) return "";
    Set<Character> v = new HashSet<Character>();
    int i = 0;
    try {
        while(true) {
            for(String s : strings) v.add(s.charAt(i));
            if(v.size() > 1) break;
            v.clear();
            i++;
        }
    } catch(StringIndexOutOfBoundsException ex) {}
    return strings.iterator().next().substring(0, i);
}

Answered By: fferri

Answer 24

The accepted solution is broken (for example, it returns a for strings like ['a', 'ba']). The fix is very simple, you literally have to change only 3 characters (from indexOf(tem1) == -1 to indexOf(tem1) != 0) and the function would work as expected.

Unfortunately, when I tried to edit the answer to fix the typo, SO told me that “edits must be at least 6 characters”. I could change more then those 3 chars, by improving naming and readability but that feels like a little bit too much.

So, below is a fixed and improved (at least from my point of view) version of the kennebec’s solution:

function commonPrefix(words) {
  max_word = words.reduce(function(a, b) { return a > b ? a : b });
  prefix   = words.reduce(function(a, b) { return a > b ? b : a }); // min word

  while(max_word.indexOf(prefix) != 0) {
    prefix = prefix.slice(0, -1);
  }

  return prefix;
}

(on jsFiddle)

Note, that it uses reduce method (JavaScript 1.8) in order to find alphanumeric max / min instead of sorting the array and then fetching the first and the last elements of it.

Answered By: Alexis

Answer 25

Golfed JS solution just for fun:

w=["hello", "hell", "helen"];
c=w.reduce(function(p,c){
    for(r="",i=0;p[i]==c[i];r+=p[i],i++){}
    return r;
});

Answered By: Dan Prince

Answer 26

Here’s an efficient solution in ruby. I based the idea of the strategy for a hi/lo guessing game where you iteratively zero in on the longest prefix.

Someone correct me if I’m wrong, but I think the complexity is O(n log n), where n is the length of the shortest string and the number of strings is considered a constant.

def common(strings)
  lo = 0
  hi = strings.map(&:length).min - 1
  return '' if hi < lo

  guess, last_guess = lo, hi

  while guess != last_guess
    last_guess = guess
    guess = lo + ((hi - lo) / 2.0).ceil

    if strings.map { |s| s[0..guess] }.uniq.length == 1
      lo = guess
    else
      hi = guess
    end
  end

  strings.map { |s| s[0...guess] }.uniq.length == 1 ? strings.first[0...guess] : ''
end

And some checks that it works:

>> common %w{ interspecies interstelar interstate }
=> "inters"

>> common %w{ dog dalmation }
=> "d"

>> common %w{ asdf qwerty }
=> ""

>> common ['', 'asdf']
=> ""

Answered By: Ben Lee

Answer 27

Fun alternative Ruby solution:

def common_prefix(*strings)
  chars  = strings.map(&:chars)
  length = chars.first.zip( *chars[1..-1] ).index{ |a| a.uniq.length>1 }
  strings.first[0,length]
end

p common_prefix( 'foon', 'foost', 'forlorn' ) #=> "fo"
p common_prefix( 'foost', 'foobar', 'foon'  ) #=> "foo"
p common_prefix( 'a','b'  )                   #=> ""

It might help speed if you used chars = strings.sort_by(&:length).map(&:chars), since the shorter the first string, the shorter the arrays created by zip. However, if you cared about speed, you probably shouldn’t use this solution anyhow. 🙂

Answered By: Phrogz

Answer 28

While reading these answers with all the fancy functional programming, sorting and regexes and whatnot, I just thought: what’s wrong a little bit of C? So here’s a goofy looking little program.

#include <stdio.h>

int main (int argc, char *argv[])
{
  int i = -1, j, c;

  if (argc < 2)
    return 1;

  while (c = argv[1][++i])
    for (j = 2; j < argc; j++)
      if (argv[j][i] != c)
        goto out;

 out:
  printf("Longest common prefix: %.*sn", i, argv[1]);
}

Compile it, run it with your list of strings as command line arguments, then upvote me for using goto!

Answered By: skagedal

Answer 29

My Javascript solution:

IMOP, using sort is too tricky.
My solution is compare letter by letter through looping the array.
Return string if letter is not macthed.

This is my solution:

var longestCommonPrefix = function(strs){
    if(strs.length < 1){
        return '';
    }

    var p = 0, i = 0, c = strs[0][0];

    while(p < strs[i].length && strs[i][p] === c){
        i++;
        if(i === strs.length){
            i = 0;
            p++;
            c = strs[0][p];
        }
    }

    return strs[0].substr(0, p);
};

Answered By: Eric Chen

Answer 30

Realizing the risk of this turning into a match of code golf (or is that the intention?), here’s my solution using sed, copied from my answer to another SO question and shortened to 36 chars (30 of which are the actual sed expression). It expects the strings (each on a seperate line) to be supplied on standard input or in files passed as additional arguments.

sed 'N;s/^(.*).*n1.*$/1n1/;D'

A script with sed in the shebang line weighs in at 45 chars:

#!/bin/sed -f
N;s/^(.*).*n1.*$/1n1/;D

A test run of the script (named longestprefix), with strings supplied as a “here document”:

$ ./longestprefix <<EOF
> interspecies
> interstelar
> interstate
> EOF
inters
$

Answered By: ack

Answer 31

Ruby

require 'abbrev'

ar = ["interspecies", "interstelar", "interstate"]
ar.abbrev.keys.min_by(&:size).chop # => "inters"

Given a set of strings,abbrev calculates the set of unambiguous abbreviations for those strings, and return a hash where the keys are all the possible abbreviations (and the values are the full strings).The shortest key minus the last char will be the common prefix.

Answered By: steenslag

Find the longest common starting substring in a set of strings

Question:

Update: my favorite solutions

Answers:

Ruby, 71 chars