coq-club AT inria.fr

Subject: The Coq mailing list

List archive

Re: [Coq-Club] Unicode tokens?

From: Makarius <makarius AT sketis.net>
To: coq-club AT inria.fr
Subject: Re: [Coq-Club] Unicode tokens?
Date: Wed, 8 Mar 2017 22:44:21 +0100
Authentication-results: mail2-smtp-roc.national.inria.fr; spf=None smtp.pra=makarius AT sketis.net; spf=Pass smtp.mailfrom=makarius AT sketis.net; spf=None smtp.helo=postmaster AT mxf960.netcup.net
Ironport-phdr: 9a23:YIr0eRxoNs1X6KzXCy+O+j09IxM/srCxBDY+r6Qd0uwRIJqq85mqBkHD//Il1AaPBtSGra4YwLOK7+igATVGusnR9ihaMdRlbFwst4Y/p0QYGsmLCEn2frbBThcRO4B8bmJj5GyxKkNPGczzNBX4q3y26iMOSF2kbVImbre9JomHhMOukuu25pf7YgNShTP7b6khAg+xqFD0v8ofm4p/Me5lzxLKq2FJZv9+3mBlOVOI2RDx+pHjr9ZY7y1Mtqd5pIZ7WqLgcvFgQA==

On 08/03/17 19:58, Vadim Zaliva wrote:
>
> On Wed, Mar 8, 2017 at 10:23 AM, Matej Kosik
> <5764c029b688c1c0d24a2e97cd764f AT gmail.com
> <mailto:5764c029b688c1c0d24a2e97cd764f AT gmail.com>>
> wrote:
>
> I guess that it is not the responsibility of the "grep" and friends
> to enable you to type unicode characters.
> (it is the job of programs providing "input method", I guess, such
> as "xim", "uim" and many others).
>
>
> Of course. As somebody who worked on Unicode support for one of Java
> compilers, I have also seen some other issues. For example, what is the
> default file encoding of .v files? Is it ASCII of Unicode? If Unicode
> what is the encoding (UTF-8/UTF-16/etc.). Could BOM be present? Is it
> mandatory? Is compiler output properly preserves Unicode in error
> messages? If in XML protocol used by IDE character offsets to source
> files are used, it should be understood that even in UCS-2 some
> characters can take more than 2 bytes (see "surrogates"), etc.

I know what you mean. Early Unicode platforms like MS Windows or Java
introduced many problems with encodings, which turned out unnecessary
some decades later. This is nicely explained in
http://utf8everywhere.org -- it also says that one and only one Unicode
encoding should be used: UTF-8. It reduces the potential of what can go
wrong, but that is still greater than zero.

For example, there are semantic problems with Unicode, due to strange
things like "Combining Characters" and "Bidirectional Text". The latter
problem can be seen here (the last lemma at the bottom):

http://isabelle.in.tum.de/website-Isabelle2016/dist/library/HOL/HOL-ex/Hebrew.html
http://isabelle.in.tum.de/website-Isabelle2016-1/dist/library/HOL/HOL-ex/Hebrew.html

The first version looks wrong (135 left of "=" instead of right of it),
iff the browser implements Unicode Bidi correctly (!). Many editors
still ignore Bidi, so the result might look right by accident: Java 7
was still wrong, but Java 8 is now right concerning the Unicode Bidi
standard.

In the second version above I have inserted Unicode Bidi-override
markers like this: "\u200E\u202D" + str + "\u202C". This helps to make
it more reliable, but copy-paste of the annotation version from Firefox
to Thunderbird still produces something that looks wrong to me:

lemma "mispar ק + mispar ל + mispar ה = 135"

It might look right to you, if your mail client ignores the Unicode Bidi
standard!

Here is some more non-sense that can be done with Unicode to deceive the
reader what is actually in the text:
https://en.wikipedia.org/wiki/IDN_homograph_attack

My conclusion: formal prover sources should be stored as plain ASCII
only. Front-ends might show that with the help of Unicode, but in case
of doubts, it should be always possible to look at the plain 7-bit ASCII
text without any special tricks.

Makarius

Re: [Coq-Club] Unicode tokens?, (continued)