How to check if a file contains plain text?

Question:

I have a folder full of files and I want to search some string inside them. The issue is that some files may be zip, exe, ogg, etc.
Can I check somehow what kind of file is it so I only open and search through txt, PHP, etc. files.
I can’t rely on the file extension.

Asked By: daniels

||

Answers:

If you’re on linux you can parse the output of the file command-line tool.

Answered By: jdizzle

You can use the Python interface to libmagic to identify file formats.

>>> import magic
>>> f = magic.Magic(mime=True)
>>> f.from_file('testdata/test.txt')
'text/plain'

For more examples, see the repo.

Answered By: Sinan Ünür

Use Python’s mimetypes library:

import mimetypes
if mimetypes.guess_type('full path to document here')[0] == 'text/plain':
    # file is plaintext
Answered By: Mike Cialowicz

try something like this :

def is_binay_file(filepathname):
    textchars = bytearray([7,8,9,10,12,13,27]) + bytearray(range(0x20, 0x7f)) + bytearray(range(0x80, 0x100))
    is_binary_string = lambda bytes: bool(bytes.translate(None, textchars))

    if is_binary_string(open(filepathname, 'rb').read(1024)):
       return True
    else:
       return False

use the method like this :

is_binay_file('<your file path name>')

This will return True if file is of binary type and False if it is of text – it should be easy to convert this to reflect your needs, fx. make a function is_text_file – I leave that up to you

Answered By: serup

The example

import mimetypes
if mimetypes.guess_type(‘full path to document here’)[0] == ‘text/plain’:
# file is plaintext

does not work. I tried mimetypes.guess_type(‘full path to document here’)[0] on some random files and got this:

text/markdown
None
text/html
None
text/x-python

There’s no ‘text/plain’. Perhaps checking that the string begins with ‘text’ would work.

Answered By: Skinny Pete
Categories: questions Tags:
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.