Python enumerate() tqdm progress-bar when reading a file?

Question:

I can’t see the tqdm progress bar when I use this code to iterate my opened file:

        with open(file_path, 'r') as f:
        for i, line in enumerate(tqdm(f)):
            if i >= start and i <= end:
                print("line #: %s" % i)
                for i in tqdm(range(0, line_size, batch_size)):
                    # pause if find a file naed pause at the currend dir
                    re_batch = {}
                    for j in range(batch_size):
                        re_batch[j] = re.search(line, last_span)

what’s the right way to use tqdm here?

Asked By: Wei Wu

||

Answers:

You’re on the right track. You’re using tqdm correctly, but stop short of printing each line inside the loop when using tqdm. You’ll also want to use tqdm on your first for loop and not on others, like so:

with open(file_path, 'r') as f:
    for i, line in enumerate(tqdm(f)):
        if i >= start and i <= end:
            for i in range(0, line_size, batch_size):
                # pause if find a file naed pause at the currend dir
                re_batch = {}
                for j in range(batch_size):
                    re_batch[j] = re.search(line, last_span)

Some notes on using enumerate and its usage in tqdm here.

I ran into this as well – tqdm is not displaying a progress bar, because the number of lines in the file object has not been provided.

The for loop will iterate over lines, reading until the next newline character is encountered.

In order to add the progress bar to tqdm, you will first need to scan the file and count the number of lines, then pass it to tqdm as the total

from tqdm import tqdm

num_lines = sum(1 for line in open('myfile.txt','r'))
with open('myfile.txt','r') as f:
    for line in tqdm(f, total=num_lines):
        print(line)
Answered By: user1446308

I’m trying to do the same thing on a file containing all Wikipedia articles. So I don’t want to count the total lines before starting processing. Also it’s a bz2 compressed file, so the len of the decompressed line overestimates the number of bytes read in that iteration, so…

with tqdm(total=Path(filepath).stat().st_size) as pbar:
    with bz2.open(filepath) as fin:
        for i, line in enumerate(fin):
            if not i % 1000:
                pbar.update(fin.tell() - pbar.n)
            # do something with the decompressed line
    # Debug-by-print to see the attributes of `pbar`: 
    # print(vars(pbar))

Thank you Yohan Kuanke for your deleted answer. If moderators undelete it you can crib mine.

Answered By: hobs

In the case of reading a file with readlines(), following can be used:

from tqdm import tqdm
with open(filename) as f:
    sentences = tqdm(f.readlines(),unit='MB')

the unit='MB' can be changed to ‘B’ or ‘KB’ or ‘GB’ accordingly.

Answered By: Ashwin Geet D'Sa

If you are reading from a very large file, try this approach:

from tqdm import tqdm
import os

file_size = os.path.getsize(filename)
lines_read= []
pbar = tqdm.tqdm(total=file_zize, unit="MB")
with open(filename, 'r', encoding='UTF-8') as file:
    while (line := file.readline()):
        lines_read.append(line)
        pbar.update(s.getsizeof(line)-sys.getsizeof('n'))
pbar.close()

I left out the processing you might want to do before the append(line)

EDIT:

I changed len(line) to s.getsizeof(line)-sys.getsizeof('n') as len(line) is not an accurate representation of how many bytes were actually read (see other posts about this). But even this is not 100% accurate as sys.getsizeof(line) is not the real length of bytes read but it’s a "close enough" hack if the file is very large.

I did try using f.tell() instead and subtracting a file pos delta in the while loop but f.tell with non-binary files is very slow in Python 3.8.10.

As per the link below, I also tried using f.tell() with Python 3.10 but that is still very slow.

If anyone has a better strategy, please feel free to edit this answer but please provide some performance numbers before you do the edit. Remember that counting the # of lines prior to doing the loop is not acceptable for very large files and defeats the purpose of showing a progress bar altogether (try a 30Gb file with 300 million lines for example)

Why f.tell() is slow in Python when reading a file in non-binary mode
https://bugs.python.org/issue11114

Answered By: ejkitchen