git-p4 message and author encoding

Question:

today i am in the position to migrate some pretty old perforce repositories to git. While this is realy interesting there is one thing that caught my eye. All special characters in the commit messages and even the author names are not in the correct encoding.

So i tried to investigate where the problem comes from.

  • first of all the perforce server does not support unicode, so setting the P4CHARSET has no effect but Unicode clients require a unicode enabled server.
  • then i checked the output of simple commands like p4 users wich where indeed in ANSI (consulting notepad++, or ISO-8859-1 according to file -bi on redirected output)
  • the locale command says LANG=en_US.UTF-8 …

after all my guess is that all p4 client output is in ISO-8859-1 but git-p4 assumes UTF-8 instead.

I tried rewriting the commit messages with

git filter-branch --msg-filter 'iconv -f iso-8859-1 -t utf-8' -- --all

but that doesnt fix the issues, especialy as it is not intended to rewrite the author names.

anyone has a guess how to force the output to be translated to UTF-8 before git-p4 recieves them?

Update:

i tried to “overwrite” the default p4 commands output with a simple shell script that i prepended to PATH

/usr/bin/p4 $@ | iconv -f iso-8859-1 -t utf-8

but that destoys the marshalled python objects that are obviously used:

  File "/usr/local/bin/git-p4", line 2467, in getBranchMapping
    for info in p4CmdList(command):
  File "/usr/local/bin/git-p4", line 480, in p4CmdList
    entry = marshal.load(p4.stdout)
ValueError: bad marshal data

Update2:

As seen here Changing default encoding of Python? i tried to set python encoding to ascii:

export export PYTHONIOENCODING="ascii"
python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

Output:

('ascii', 'ascii')

but still all messages and authors are not correctly migrated.

Update 3:

Even trying to patch the git-p4.py def commit(self, details, files, branch, parent = "") function did not help:
Changing

self.gitStream.write(details["desc"])

to one of those

self.gitStream.write(details["desc"].encode('utf8', 'replace'))
self.gitStream.write(unicode(details["desc"],'utf8')

did just raise:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 29: ordinal not in range(128)

as i am no python developer i have no idea what to try next.

Asked By: dag

||

Answers:

I suspect the type of details["desc"] is byte string. (str for python2).

Therefore you need to decode it to Unicode before you encode it.

print type(details["desc"])

to find out the type.

details["desc"].decode("iso-8859-1").encode("UTF-8")

might help to convert from iso-8859-1 to UTF-8.

Answered By: Douglas Leeder

Git 2.38 (Q3 2022) should improve the issue, since before 2.38, git p4 did not handle non-ASCII client name well, which has been corrected.

See commit d205483 (21 Jul 2022), and commit 34f67c9 (08 Jul 2022) by Kilian Kilger (cohomology).
(Merged by Junio C Hamano — gitster in commit e59acea, 01 Aug 2022)

git-p4: fix bug with encoding of p4 client name

Signed-off-by: Kilian Kilger
Reviewed-by: Tao Klerks

The Perforce client name can contain arbitrary characters which do not decode to UTF-8.
Use the fallback strategy implemented in metadata_stream_to_writable_bytes() also for the client name.

Answered By: VonC
Categories: questions Tags: , , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.