Calculating BLEU and Rouge score as fast as possible

Question:

I have around 200 candidate sentences and for each candidate, I want to measure the bleu score by comparing each sentence with thousands of reference sentences. These references are the same for all candidates. Here is how I’m doing it right now:

ref_for_all = [reference] *len(sents)
score = corpus_bleu(ref_for_all, [i.split() for i in sents], weights=(0, 1, 0, 0))

The reference contains the whole corpus I want to compare each sentence with, and sent are my sentences (candidates). Unfortunately, this takes too long and given the experimental nature of my code, I cannot wait that long to get the results. Is there any other way (for example using Regex) that I can get these scores faster? I also have this problem with Rouge, so any suggestion is highly appreciated for that too!

Asked By: mitra mirshafiee

||

Answers:

After searching and experimenting with different packages and measuring the time each one needed to calculate the scores, I found the nltk corpus bleu and PyRouge the most efficient ones. Just keep in mind that in each record, I had multiple hypotheses and that’s why I calculate the means once for each record and
This is how I did it for BLEU:

reference = [[i.split() for i in ref]]

def find_my_bleu(text, w):

   candidates_ = [text.split()]
   return corpus_bleu(reference, candidates_, weights=w, 
                                    smoothing_function=cc.method4)

def get_final_bleu(output_df):

   print('Started calculating the bleu scores...')
   output_df.loc[:, 'bleu_1'] = output_df.loc[:, 'final_predicted_verses'].apply(lambda x:[find_my_bleu(t, (1, 0, 0, 0)) for t in x])
   output_df.loc[:, 'bleu_2'] = output_df.loc[:, 'final_predicted_verses'].apply(lambda x:[find_my_bleu(t, (0, 1, 0, 0)) for t in x])
   output_df.loc[:, 'bleu_3'] = output_df.loc[:, 'final_predicted_verses'].apply(lambda x:[find_my_bleu(t, (0, 0, 1, 0)) for t in x])


   print('Now the average score...')
   output_df.loc[:, 'bleu_3_mean'] = output_df.loc[:, 'bleu_3'].apply(lambda x:np.mean(x))
   output_df.loc[:, 'bleu_2_mean'] = output_df.loc[:, 'bleu_2'].apply(lambda x:np.mean(x))
   output_df.loc[:, 'bleu_1_mean'] = output_df.loc[:, 'bleu_1'].apply(lambda x:np.mean(x))

   print('mean bleu_3 score: ', np.mean(output_df.loc[:, 'bleu_3_mean']))
   print('mean bleu_2 score: ', np.mean(output_df.loc[:, 'bleu_2_mean']))
   print('mean bleu_1 score: ', np.mean(output_df.loc[:, 'bleu_1_mean']))

For ROUGE:

rouge = PyRouge(rouge_n=(1, 2), rouge_l=True, rouge_w=False, rouge_s=False, rouge_su=False)

def find_my_rouge(text):
    hypotheses = [[text.split()]]
    score = rouge.evaluate_tokenized(hypotheses, [[reference_rouge]])
    return score

Then for taking the mean of all:

def get_short_rouge(list_dicts):

    """ get the mean of all generated text for each record"""
    l_r = 0
    l_p = 0
    l_f = 0

    one_r = 0
    one_p  = 0
    one_f  = 0

    two_r  = 0
    two_p  = 0
    two_f  = 0
    
    for d in list_dicts:
        
        
        one_r += d['rouge-1']['r']
        one_p += d['rouge-1']['p']
        one_f += d['rouge-1']['f']


        two_r += d['rouge-2']['r']
        two_p += d['rouge-2']['p']
        two_f += d['rouge-2']['f']
        
        l_r += d['rouge-l']['r']
        l_p += d['rouge-l']['p']
        l_f += d['rouge-l']['f']

    length = len(list_dicts)

    return {'rouge-1': {'r': one_r/length , 'p': one_p/length , 'f': one_f/length},
            'rouge-2': {'r': two_r/length, 'p': two_p/length, 'f': two_f/length},
            'rouge-l': {'r': l_r/length, 'p': l_p/length , 'f': l_f/length}
            }

def get_overal_rouge_mean(output_df):
    print('Started getting the overall rouge of each record...')
    output_df.loc[:, 'rouge_mean'] = output_df.loc[:, 'rouge'].apply(lambda x: get_short_rouge(x))
    print('Started getting the overall rouge of all record...')
    l_r = 0
    l_p = 0
    l_f = 0

    one_r = 0
    one_p  = 0
    one_f  = 0

    two_r  = 0
    two_p  = 0
    two_f  = 0

    for i in range(len(output_df)):
        d = output_df.loc[i, 'rouge_mean']
        
        one_r += d['rouge-1']['r']
        one_p += d['rouge-1']['p']
        one_f += d['rouge-1']['f']


        two_r += d['rouge-2']['r']
        two_p += d['rouge-2']['p']
        two_f += d['rouge-2']['f']
        
        l_r += d['rouge-l']['r']
        l_p += d['rouge-l']['p']
        l_f += d['rouge-l']['f']

    length = len(output_df)
    print('overall rouge scores: ')
    print({'rouge-1': {'r': one_r/length , 'p': one_p/length , 'f': one_f/length},
                'rouge-2': {'r': two_r/length, 'p': two_p/length, 'f': two_f/length},
                'rouge-l': {'r': l_r/length, 'p': l_p/length , 'f': l_f/length}
                })
    return output_df

I hope it helps anyone who’s had this problem.

Answered By: mitra mirshafiee

First of all, I recommend doing these kinds of experiements on multi-processing basis. You can configure the multiprocessing arguments in different ways. For instance, you can put one candidate sentences + the sentences from the other list as a single element of the multiprocessing list. In this way, when you call pool, each element of the list will be processed by a single dedicated CPU. Following is a pseudo example:

mp_list = []
large_list = [...]
candidate_sentences = [...]
for i, cand_sent in enumerate(candidate_sentences):
   mp_list.append((i, cand_sent, large_list))

pool = Pool(os.cpu_count())
for out_scores in pool.imap_unordered(scoring_fn, mp_list):
   # do something; store returned scores...

To keep track of the returned scores for the candidate sentences (a CPU might get its job done faster than the other CPU), you will need to return the i index from the scoring function.

As to the packages, at some point, I had gotten this benchmarked in different ROUGE wrappers. Based off my benchmarking experiemnts, the rouge-score package (implemented by Google research) does seem to be the fastest bet.

To estimate the ROUGE score, though, I usually use the implementation from the PreSumm package. The implemented script gives super fast estimated scores for ROUGE-1 and ROUGE-2 metrics (recall, precision, and f1) ––Sadly, it doesn’t include ROUGE-L. Although, I don’t recommend using it for reporting the system perforamnce, it can be used for performing oralce experiments such as oracle sentence labeling and etc.

Answered By: inverted_index