R Reticulate – Reading file path having arabic (persian) UTF-8 characters in file name

Question:

I’m trying to read files that have arabic characters in the file name.

The problem is I can’t read the file, because the arabic letters in the file name causes an issue.

In the below url, I see it’s recommended to use python to parse such unicode namings, so I’ve decided to use python via reticulate just to read the files.
It would be nice how to read files with arabic file names using R as well.

This topic has been raise in the below links:

Read in file with UTF-8 character in path in R

Manipulating files with non-English names in R

See reticulate environment details in ‘My Environment Seciton’ below.

Switching to python to read such files reticulate::repl_python()

import locale
import os
locale.setlocale(locale.LC_ALL,'persian')

The below are my two test files:

1- The first one is a test excel I’ve created.

2- The second one is the file that I can’t read. This file has a persian file name with arabic letters.

>>> os.chdir(r'C:Data')
>>> os.listdir()
['File ملف.xlsx', 'توزیع نقدی 1401.xls']

# Reading the first file
>>> df = pd.read_excel('File ملف.xls')
>>>

# Reading the second file
>>> df2 = pd.read_excel('توزیع نقدی 1401.xls')
FileNotFoundError: [Errno 2] No such file or directory: 'طھظˆط²ظٹط¹ ظ†ظ‚ط¯ظٹ 1401.xls'
>>> 

It seems that RStudio/Reticulate’s python is unable to read this arabic file name, although it read the test one I created.

I have attempted to read it from windows cmd and I was able to read the file using the same python I used for reticulate

(py-reticulate) C:mohamed_elsayyadvenvpy-reticulateScripts>python
Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os, pandas as pd
>>> os.chdir(r'C:Data')
>>> df2 = pd.read_excel('توزیع نقدی 1401.xls')
>>> df2.shape
(8, 14)
>>>

The only workaround that worked for me is using the below, but I want to pass the file name in arabic to the pd.read_excel inside reticulate.

>>> df2 = pd.read_excel(os.listdir()[1])
>>> df2.shape
(8, 14)

My Environment:

Reticulate

py_config()
python:         C:/mohamed_elsayyad/venv/py-reticulate/Scripts/python.exe
libpython:      C:/Users/mohamed.elsayyad/AppData/Local/Programs/Python/Python39/python39.dll
pythonhome:     C:/mohamed_elsayyad/venv/py-reticulate
version:        3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/mohamed_elsayyad/venv/py-reticulate/Lib/site-packages/numpy
numpy_version:  1.21.2

NOTE: Python version was forced by RETICULATE_PYTHON
> 

Session Info

> Sys.setlocale("LC_ALL", "persian")
[1] "LC_COLLATE=Persian_Iran.1256;LC_CTYPE=Persian_Iran.1256;LC_MONETARY=Persian_Iran.1256;LC_NUMERIC=C;LC_TIME=Persian_Iran.1256"
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Persian_Iran.1256  LC_CTYPE=Persian_Iran.1256    LC_MONETARY=Persian_Iran.1256 LC_NUMERIC=C                  LC_TIME=Persian_Iran.1256    
system code page: 1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DT_0.23         echarts4r_0.4.4 readxl_1.4.0    magrittr_2.0.1  scales_1.2.0    rmarkdown_2.14  gridExtra_2.3   knitr_1.33      stringi_1.7.6   skimr_2.1.4     reticulate_1.20
[12] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.6    ggplot2_3.3.6   tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] httr_1.4.3        jsonlite_1.7.2    viridisLite_0.4.0 modelr_0.1.8      shiny_1.7.1       assertthat_0.2.1  cellranger_1.1.0  pillar_1.7.0      backports_1.4.1  
[10] lattice_0.20-44   glue_1.6.2        digest_0.6.27     promises_1.2.0.1  rvest_1.0.2       colorspace_2.0-2  htmltools_0.5.2   httpuv_1.6.5      Matrix_1.3-4     
[19] pkgconfig_2.0.3   broom_1.0.0       haven_2.5.0       xtable_1.8-4      webshot_0.5.3     svglite_2.0.0     later_1.3.0       tzdb_0.3.0        generics_0.1.3   
[28] ellipsis_0.3.2    withr_2.5.0       repr_1.1.4        lazyeval_0.2.2    cli_3.3.0         crayon_1.5.1      mime_0.12         evaluate_0.15     fs_1.5.2         
[37] fansi_0.5.0       xml2_1.3.3        data.table_1.14.2 tools_4.1.1       hms_1.1.1         lifecycle_1.0.1   plotly_4.10.0     munsell_0.5.0     reprex_2.0.1     
[46] kableExtra_1.3.4  compiler_4.1.1    systemfonts_1.0.2 rlang_1.0.3       grid_4.1.1        rstudioapi_0.13   htmlwidgets_1.5.4 base64enc_0.1-3   gtable_0.3.0     
[55] DBI_1.1.3         R6_2.5.1          lubridate_1.8.0   fastmap_1.1.0     utf8_1.2.2        Rcpp_1.0.8.3      vctrs_0.4.1       png_0.1-7         dbplyr_2.2.1     
[64] tidyselect_1.1.2  xfun_0.25 

Answers:

Issue resolved as per JosefZ’s comment by setting Sys.setlocale("LC_ALL", "persian.65001")