Antiword can't open 'C:\?????? ????????\info.doc' for reading in Windows
Question:
Description
I am using texttract python library to extract word document text. The problem is that: if the path contains arabic characters, then, antiword outputs that can’t read the document.
Example
import textract
# path = 'C:\test-docs\info.doc'
path = 'C:\مجلدات اختبارية\info.doc'
text = textract.process(path, encoding='UTF-8')
print(text)
Error
Traceback (most recent call last):
File "c:test-extract-doc.py", line 5, in <module>
text = textract.process(path, encoding='UTF-8')
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsers__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsersutils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsersdoc_parser.py", line 9, in extract
stdout, stderr = self.run(['antiword', filename])
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsersutils.py", line 100, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `antiword C:مجلدات اختباريةinfo.doc` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"I can't find the name of your HOME directoryrnI can't open 'C:\?????? ????????\info.doc' for readingrn"
Notes
- The process is working fine if I use
.docx
documents.
- If I use a directory name without arabic charcters it also works for
.doc
documents.
Answers:
After digging into the source code of textract
, it becomes clear that for extraction from .doc
the (ancient) command line tool antiword is used.
class Parser(ShellParser):
"""Extract text from doc files using antiword.
"""
def extract(self, filename, **kwargs):
stdout, stderr = self.run(['antiword', filename])
return stdout
Python does everything properly, but apparently antiword itself has issues with the way it parses its arguments, at least on Windows, so passing a Unicode path results in breakage.
Luckily Windows offers a way of converting any path into a backwards-compatible form of ANSI-only 8.3 filenames – the so-called "short" paths, which can be requested from the system with a Win32 API call. Short paths and regular ("long") paths are interchangeable, but legacy software might like short paths better.
This provides a work-around: Retrieve the short path for any .doc
file and give that to antiword instead. Win32 API calls are supplied in Python by the win32api
module:
from win32api import GetShortPathName
def extract_text(path):
if path.lower().endswith(".doc"):
path = GetShortPathName(path)
return textract.process(path, encoding='UTF-8')
text = extract_text('C:\مجلدات اختبارية\info.doc')
print(text)
Description
I am using texttract python library to extract word document text. The problem is that: if the path contains arabic characters, then, antiword outputs that can’t read the document.
Example
import textract
# path = 'C:\test-docs\info.doc'
path = 'C:\مجلدات اختبارية\info.doc'
text = textract.process(path, encoding='UTF-8')
print(text)
Error
Traceback (most recent call last):
File "c:test-extract-doc.py", line 5, in <module>
text = textract.process(path, encoding='UTF-8')
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsers__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsersutils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsersdoc_parser.py", line 9, in extract
stdout, stderr = self.run(['antiword', filename])
File "C:UsersmohjaAppDataLocalProgramsPythonPython39libsite-packagestextractparsersutils.py", line 100, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `antiword C:مجلدات اختباريةinfo.doc` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"I can't find the name of your HOME directoryrnI can't open 'C:\?????? ????????\info.doc' for readingrn"
Notes
- The process is working fine if I use
.docx
documents. - If I use a directory name without arabic charcters it also works for
.doc
documents.
After digging into the source code of textract
, it becomes clear that for extraction from .doc
the (ancient) command line tool antiword is used.
class Parser(ShellParser):
"""Extract text from doc files using antiword.
"""
def extract(self, filename, **kwargs):
stdout, stderr = self.run(['antiword', filename])
return stdout
Python does everything properly, but apparently antiword itself has issues with the way it parses its arguments, at least on Windows, so passing a Unicode path results in breakage.
Luckily Windows offers a way of converting any path into a backwards-compatible form of ANSI-only 8.3 filenames – the so-called "short" paths, which can be requested from the system with a Win32 API call. Short paths and regular ("long") paths are interchangeable, but legacy software might like short paths better.
This provides a work-around: Retrieve the short path for any .doc
file and give that to antiword instead. Win32 API calls are supplied in Python by the win32api
module:
from win32api import GetShortPathName
def extract_text(path):
if path.lower().endswith(".doc"):
path = GetShortPathName(path)
return textract.process(path, encoding='UTF-8')
text = extract_text('C:\مجلدات اختبارية\info.doc')
print(text)