Creating random binary files

Question:

I’m trying to use python to create a random binary file. This is what I’ve got already:

f = open(filename,'wb')
for i in xrange(size_kb):
    for ii in xrange(1024/4):
        f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))

f.close()

But it’s terribly slow (0.82 seconds for size_kb=1024 on my 3.9GHz SSD disk machine). A big bottleneck seems to be the random int generation (replacing the randint() with a 0 reduces running time from 0.82s to 0.14s).

Now I know there are more efficient ways of creating random data files (namely dd if=/dev/urandom) but I’m trying to figure this out for sake of curiosity… is there an obvious way to improve this?

Asked By: gardarh

||

Answers:

IMHO – the following is completely redundant:

f.write(struct.pack("=I",random.randint(0,sys.maxint*2+1)))

There’s absolutely no need to use struct.pack, just do something like:

import os

fileSizeInBytes = 1024
with open('output_filename', 'wb') as fout:
    fout.write(os.urandom(fileSizeInBytes)) # replace 1024 with a size in kilobytes if it is not unreasonably large

Then, if you need to re-use the file for reading integers, then struct.unpack then.

(my use case is generating a file for a unit test so I just need a
file that isn’t identical with other generated files).

Another option is to just write a UUID4 to the file, but since I don’t know the exact use case, I’m not sure that’s viable.

Answered By: Jon Clements

The python code you should write completely depends on the way you intend to use the random binary file. If you just need a “rather good” randomness for multiple purposes, then the code of Jon Clements is probably the best.

However, on Linux OS at least, os.urandom relies on /dev/urandom, which is described in the Linux Kernel (drivers/char/random.c) as follows:

The /dev/urandom device […] will return as many bytes as are
requested. As more and more random bytes are requested without giving
time for the entropy pool to recharge, this will result in random
numbers that are merely cryptographically strong. For many
applications, however, this is acceptable.

So the question is, is this acceptable for your application ? If you prefer a more secure RNG, you could read bytes on /dev/random instead. The main inconvenient of this device: it can block indefinitely if the Linux kernel is not able to gather enough entropy. There are also other cryptographically secure RNGs like EGD.

Alternatively, if your main concern is execution speed and if you just need some “light-randomness” for a Monte-Carlo method (i.e unpredictability doesn’t matter, uniform distribution does), you could consider generate your random binary file once and use it many times, at least for development.

Answered By: tvuillemin

Here’s a complete script based on accepted answer that creates random files.

import sys, os
def help(error: str = None) -> None:
    if  error and error != "help":
        print("***",error,"nn",file=sys.stderr,sep=' ',end='');
        sys.exit(1)

    print("""tCreates binary files with random content""", end='n')
    print("""Usage:""",)
    print(os.path.split(__file__)[1], """ "name1" "1TB" "name2" "5kb"
        Accepted units: MB, GB, KB, TB, B""")
    sys.exit(2)

# https://stackoverflow.com/a/51253225/1077444
def convert_size_to_bytes(size_str):
    """Convert human filesizes to bytes.
    ex: 1 tb, 1 kb, 1 mb, 1 pb, 1 eb, 1 zb, 3 yb
    To reverse this, see hurry.filesize or the Django filesizeformat template
    filter.

    :param size_str: A human-readable string representing a file size, e.g.,
    "22 megabytes".
    :return: The number of bytes represented by the string.
    """
    multipliers = {
        'kilobyte':  1024,
        'megabyte':  1024 ** 2,
        'gigabyte':  1024 ** 3,
        'terabyte':  1024 ** 4,
        'petabyte':  1024 ** 5,
        'exabyte':   1024 ** 6,
        'zetabyte':  1024 ** 7,
        'yottabyte': 1024 ** 8,
        'kb': 1024,
        'mb': 1024**2,
        'gb': 1024**3,
        'tb': 1024**4,
        'pb': 1024**5,
        'eb': 1024**6,
        'zb': 1024**7,
        'yb': 1024**8,
    }

    for suffix in multipliers:
        size_str = size_str.lower().strip().strip('s')
        if size_str.lower().endswith(suffix):
            return int(float(size_str[0:-len(suffix)]) * multipliers[suffix])
    else:
        if size_str.endswith('b'):
            size_str = size_str[0:-1]
        elif size_str.endswith('byte'):
            size_str = size_str[0:-4]
    return int(size_str)


if __name__ == "__main__":
    input = {} #{ file: byte_size }
    if (len(sys.argv)-1) % 2 != 0:
        print("-- Provide even number of arguments --")
        print(f'--tGot: {len(sys.argv)-1}: "' + r'" "'.join(sys.argv[1:]) +'"')
        sys.exit(2)
    elif len(sys.argv) == 1:
        help()

    try:
        for file, size_str in zip(sys.argv[1::2], sys.argv[2::2]):
            input[file] = convert_size_to_bytes(size_str)
    except ValueError as ex:
        print(f'Invalid size: "{size_str}"', file=sys.stderr)
        sys.exit(1)

    for file, size_bytes in input.items():
        print(f"Writing: {file}")
        #https://stackoverflow.com/a/14276423/1077444
        with open(file, 'wb') as fout:
            while( size_bytes > 0 ):
                wrote = min(size_bytes, 1024) #chunk
                fout.write(os.urandom(wrote))
                size_bytes -= wrote
Answered By: L8Cod3r
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.