[cvsnt] Re: UTF conversion issues after upgrade to 2.0.34

Wed Apr 7 19:45:33 BST 2004

Hmm, i do not agree, but maybe we have simply a communication problem
because i'm not a native english speaker. I try to point out what i meant:

There are four variations of UTF-16 files: UTF-16 and UTF-16 (one is BE, one
is LE), UTF-16BE and UTF-16LE. The first two contain a BOM, which signals
the byte order, the other two don't. They don't need one because they are
explicitly call UTF-16BE and UTF-16LE. There are not strict rules, when to
use a BOM, it depends on the protocol that uses the text stream. Example
given, Microsoft declared that txt files must have a BOM. On the other
hands side, the usage of a BOM can be tabooed (see
http://www.unicode.org/faq/utf_bom.html#28). To have an example out of my
work: We save content of a database in CVS. In the DB the unicode string
has not BOM and when we save it to file and put it to CVS we don't add a
BOM. All four variations are fully legal.

What cvsnt seems to do during commit is to cut of the first to bytes where
it expects to have the BOM. And while checkout/update, it adds "0xFF 0xFE"
in front of the stream. What does this mean for the four variations:
1) UTF-16 (BE with BOM)
Input  file: 0xFF 0xFE 0x54 0x00 0x68 0x00 0x69 0x00 0x73 0x00    => "This" 
Output file: 0xFF 0xFE 0x54 0x00 0x68 0x00 0x69 0x00 0x73 0x00    => "This" 

2) UTF-16 (LE with BOM)
Input  file: 0xFE 0xFF 0x00 0x54 0x00 0x68 0x00 0x69 0x00 0x73     => "This" 
Output file: 0xFF 0xFE 0x00 0x54 0x00 0x68 0x00 0x69 0x00 0x73     => ----

3) UTF-16BE
Input  file: 0x54 0x00 0x68 0x00 0x69 0x00 0x73 0x00    => "This" 
Output file: 0xFF 0xFE 0x68 0x00 0x69 0x00 0x73 0x00    => "his"

4) UTF-16LE
Input  file: 0x00 0x54 0x00 0x68 0x00 0x69 0x00 0x73     => "This" 
Output file: 0xFF 0xFE 0x00 0x68 0x00 0x69 0x00 0x73     => ----

In case 1) everything is ok.
In case 2) The BOM says it is a BE file, but content is LE => damage
In case 3) lost one byte of content, but added a BOM, which might be
undesired => damage
In case 4) lost one byte, but added a BE BOM to a LE stream => damage

So from the 4 variations only one was intact after commit/update.

To make one point clear: In my opinion it is 100 percent ok to support only
BE UTF-16, but it should be more precisely documented, that this is the
only format. In particular to the following: In my experience it is very
difficult to delete something using cvs. cvs works very defensive, which is
a very, very fine thing. Whenever something is about to change, cvs makes
backup files. To be consistent with this, i propose to defuse to behaviour
described above in that way to reject the commit of "Unicode" files when
they don't start with 0xFF 0xFE.

Hope that clears the fog
Olaf

Tony Hoyle wrote:

> Olaf Groeger wrote:
> 
>> 
>> But be aware that this must be UTF-16 BE including BOM (0xff 0xfe). All
>> other UTF-16 (LE and/or no BOM) will be silently damaged.
>> 
> LE isn't common on intel systems (in fact it's basically unheard of).  The
> file is still a perfectly valid Unicode file - the BOM is part of the
> standard, precisely to avoid the problems distinguishing between LE and
> BE.
> 
> If you want the exact file use binary mode... you lose merging though.
> 
> Tony