[From nobody Sun Jun  3 07:07:48 2007
Subject: Bug #148: shp2pgsql has problems with certain codepoints
From: Mark Cave-Ayland &lt;mark.cave-ayland@ilande.co.uk&gt;
To: postgis-devel@refractions.net
Cc: Bruce Rusk &lt;br79@cornell.edu&gt;
Content-Type: text/plain
Message-Id: &lt;1180786029.5210.22.camel@mca-desktop&gt;
Mime-Version: 1.0
X-Mailer: Evolution 2.6.1 
Date: Sat, 02 Jun 2007 13:07:12 +0100
Content-Transfer-Encoding: 7bit

Hi everyone,

After playing around this morning, I've discovered the cause of bug 148;
it's related to the automatic trimming of spaces from strings by
shapelib.

One of the problem shape files I have contains the following data in a
string field which is encoded in UTF8: (note that it is padded with
spaces)

e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
8c 97 e6 9d 9c e5 9f 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
20 20 20 20 20 20 20 20 20 20 20 20 20 20

However, the resulting output in the SQL file looked like this:

e5 ae 89 e5 be bd e6 bd 9c e5 b1 b1 e5 8e bf e5
8c 97 e6 9d 9c e5 9f

So it's fairly easy to see what's going on: since TRIM_DBF_WHITESPACE is
defined in shapefil.h, shapelib is attempting to trim trailing spaces
from the fields. However, it is eating one too many characters from the
end since the final character should read &quot;e5 9f 20&quot; - this is because
it naively removes all 0x20 characters from the end of the string
without realising the final 0x20 is part of a UTF8 character.

I've had a look at the code, and I think the easiest fix is to disable
TRIM_DBF_WHITESPACE in shapefil.h, and then alter make_good_string() so
that it strips whitespace itself in a UTF8-aware fashion just after the
input string has been converted by the utf8() function. I'll see if I
can commit a fix for this over the next few days.

Finally, I did note that the shp2pgsql options page displayed when
running shp2pgsql without any options mentions that the default encoding
is ASCII. However, this is not the case since when an iconv-enabled
shp2pgsql is run, it always issues a &quot;SET client_encoding = UTF8&quot; - I'll
change the options text to reflect this.


Kind regards,

Mark.

-- 
ILande - Open Source Consultancy
http://www.ilande.co.uk

]