Puking blood: how to make a scrapy, python, and postrgres ecosystem inside Windows 7 that can deal with unicode

Question:

Would really appreciate if anyone can help me either fix the problems I describe below, or (worst case) suggest an alternative environment that would work (although I’m loathe to upgrade to Windows 10)


I am scraping mostly-english webpages from a Japanese website. A few required fields have kanji in them.

I’m using scrapy, postgres 9.5, and python 2.7 on a Windows 7 installation.

Scrapy has to run in a cmd.exe shell, and I’m examining the database results in a psql.exe instance also running in a cmd.exe shell. I’ve been using Console2 application for the cmd.exe.

It’s a horrible experience to debug in this setup:

scrapy shell

I’m unable to do any diagnostic print() messages because the kanji causes an Exception

>  print st['kanji_name']   
> File "C:UsersmdsAnaconda2libencodingscp437.py", line 12, in encode
>     return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode characters in
> position 0-8: character maps to <undefined>

I’ve seen solutions about changing the active code page to with chcp 65001 but scrapy doesn’t understand cp65001 apparently

C:Users_pythonj_school>chcp 65001
Active code page: 65001

Throws the error:

C:Users_pythonj_school>scrapy crawl j_school

Traceback (most recent call last):
  File "C:UserssAnaconda2libsite-packagestwistedinternetdefer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:Users_pythonj_schoolj_schoolspidersj_school_spider.py", line 141, in parse
    print(st['english_name'])
LookupError: unknown encoding: cp65001

PSQL

PSQL already warns me on startup

C:Program FilesPostgreSQL9.5bin>psql m_experiment postgres
psql (9.5rc1)
WARNING: Console code page (437) differs from Windows code page (1252)
         8-bit characters might not work correctly. See psql reference
         page "Notes for Windows users" for details.

regardless of whether I try the chcp 65001, psql still will not print these.

m_experiment=# select * from schools limit 1;
ERROR:  character with byte sequence 0xe6 0x9d 0xb1 in encoding "UTF8" has no equivalent in encoding "WIN1252"

I’ve also tried to set the client_encoding, but this then blows up something and postgres insists I’m out of memory!

m_experiment=# SET client_encoding = 'UTF8';
SET
m_experiment=# show client_encoding;
Not enough memory.
m_experiment=#

I discovered multiple bug reports about this issue circa 2011 but it was never fixed??? Anyway, I found a manual way to fix it, pset pager off incantation solves the issue.

Now psql can at least spit out a response, although it doesn’t render the kanji correctly.

m_experiment=# select english_name, kanji_name from schools limit 1;
            english_name             |     kanji_name
-------------------------------------+--------------------
 TOKYO INTERNATIONAL JAPANESE SCHOOL | æ±京国際日本語学院
(1 row)

One hack-solution was to change my locale to Japanese. Now the console shows my kanji properly. But it screws up the display thereafter (the >prompt shows up strangely and the cursor graphic doesn’t align to where the cursor actually is!).

enter image description here

Asked By: user3556757

||

Answers:

From your error message, cp437 is the US Windows console default encoding. You could try temporarily switching your system locale to “Japanese(Japan)” so you could print Kanji to the console. Go to Control Panel, Region and Language, Administrative tab and click “Change system locale…”. After rebooting, the default Windows console default encoding should be one suitable for Japanese.

I’ve done this before to print Chinese to the console. The setting only affects non-Unicode programs, and most programs are fully Unicode nowadays.

Answered By: Mark Tolonen
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.