Author |
Message
|
marko.pitkanen |
Posted: Tue Nov 18, 2014 1:13 am Post subject: IIB 9 Unicode and UTF-8 support clarification needed |
|
|
Chevalier
Joined: 23 Jul 2008 Posts: 440 Location: Jamsa, Finland
|
Hi All,
I didn't do a proper investigation if this subject have already been covered here. If so please feel free to give pointer to the appropriate thread.
Question is what characters / languages are supported by the broker
through MQ, broker wide HTTP -listener and file nodes? This is not obvious from the documentation because for example:
In the documentation page for IIB9 it says that for example UTF-8 has built-in support (no restrictions mentioned).
In the document for MQ Unicode Conversion support it says that the support for UTF-16 and UTF-8 in WebSphere MQ is limited to those Unicode characters that can be encoded in UCS-2.
Are there any restrictions which characters can be used in the application messages processed by / with IIB 9?
--
Marko |
|
Back to top |
|
|
smdavies99 |
Posted: Tue Nov 18, 2014 1:28 am Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed |
|
|
Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
marko.pitkanen wrote: |
Are there any restrictions which characters can be used in the application messages processed by / with IIB 9?
|
AFAIK, no there aren't.
BUT there is a big IF though. The IF is reflected by my guess that several hundred posts here about this sort of thing.
The biggest issue I've seen is the fact that the CCSID of the contents differ from the CCSID in the message header/descriptor.
This causes no end of problems.
For example
It is no use having
Code: |
<?xml version="1.0" encoding="ISO8859-1 ?>
|
When the rest of the XML data is actually UTF-8 encoded.
If is no use having an MQMD.CodeCharSetId=1208 then the message body is coded as 923.
Finally, there are times when we see CCSID Conversion done by a channel when it need not have been done.
It is surprising how many architects and developers simply don't understand this.
I first started woking on multinational Character sets around 1981 and there are still times when I get it wrong. _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
|
marko.pitkanen |
Posted: Tue Nov 18, 2014 1:46 am Post subject: |
|
|
Chevalier
Joined: 23 Jul 2008 Posts: 440 Location: Jamsa, Finland
|
Thanks,
In theory how would for example the 4-byte UTF-8 Kanji characters
Code: |
Normal Kanji ==> UTF-8 octets: 3 bytes ==> UTF-8 code point: up to 0xFFFF (2 bytes)
Rare Kanji ==> UTF-8 octets: 4 bytes ==> UTF-8 code point: above 0xFFFF (4 bytes) |
work in IIB's http or file interface?
--
Marko |
|
Back to top |
|
|
smdavies99 |
Posted: Tue Nov 18, 2014 2:45 am Post subject: |
|
|
Jedi Council
Joined: 10 Feb 2003 Posts: 6076 Location: Somewhere over the Rainbow this side of Never-never land.
|
The Kanji character are not UTF-8, they are UTF-16 or UTF-32
The UTF stream uses BOM's (Byte Order Marks)
http://en.wikipedia.org/wiki/Byte_order_mark
to switch between the different types. Remember to get the endian correct though.
Why don't you try it for yourself?
Look at the message tree _________________ WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995
Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions. |
|
Back to top |
|
|
Vitor |
Posted: Tue Nov 18, 2014 6:20 am Post subject: |
|
|
Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
marko.pitkanen wrote: |
In theory how would for example the 4-byte UTF-8 Kanji characters
Code: |
Normal Kanji ==> UTF-8 octets: 3 bytes ==> UTF-8 code point: up to 0xFFFF (2 bytes)
Rare Kanji ==> UTF-8 octets: 4 bytes ==> UTF-8 code point: above 0xFFFF (4 bytes) |
work in IIB's http or file interface?
|
They would (like all data) be passed into IIB over http from a system that supported those code points or read from a file on an OS that supported those code point, and then stored in the IIB message tree as UTF-16.
If you then tried to serialise the data into an http put or write it to a file using a CCSID that doesn't support all the code points which are in the message tree at the time of serialisation (and note that this CCSID has no connection with the CCSID of the inbound data) then IIB will abend in the traditional manner & roll back. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
|
marko.pitkanen |
Posted: Tue Nov 18, 2014 6:51 am Post subject: |
|
|
Chevalier
Joined: 23 Jul 2008 Posts: 440 Location: Jamsa, Finland
|
Thanks,
We will try to produce a problem. Question is rather theoretical, just to make sure that the same restrictions as for MQ
Code: |
The support for UTF-16 and UTF-8 in WebSphere MQ is therefore limited to those Unicode characters that can be encoded in UCS-2. |
is or isn't true for the broker.
--
Marko |
|
Back to top |
|
|
Vitor |
Posted: Tue Nov 18, 2014 7:01 am Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed |
|
|
Grand High Poobah
Joined: 11 Nov 2005 Posts: 26093 Location: Texas, USA
|
marko.pitkanen wrote: |
In the documentation page for IIB9 it says that for example UTF-8 has built-in support (no restrictions mentioned). |
It also lists all of the UTF-16, UTF-32 and some ISO I've never heard of. _________________ Honesty is the best policy.
Insanity is the best defence. |
|
Back to top |
|
|
kimbert |
Posted: Tue Nov 18, 2014 8:10 am Post subject: |
|
|
Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
WMB and IIB can handle all of Unicode. No restrictions, no qualifications.
In particular
- it can handle any valid character in the UTF-8 and UTF-16 encodings, regardless of the number of bytes that it occupies
- it can handle any valid character in UTF-32
- it is not limited to the range of characters in UCS-2. That would restrict the product to the Basic Multilingual Plane (BMP) which would be very limiting.
smdavies99 said:
Quote: |
The Kanji character are not UTF-8, they are UTF-16 or UTF-32 |
That statement could be misinterpreted.
UTF-8, UTF-16 and UTF-32 encode *all* of Unicode. So any character that is valid in one of those encodings is also valid in both of the others. But the character will be *encoded* differently ( it will be represented by a different byte sequence) in each case.
Kanji characters, and any other characters outside of the BMP, will be decoded from their original code page/encoding and will appear in the message tree as a UTF-16 'surrogate pair'. Note that the original code page does not need to be UTF-8 or UTF-16; the character could come in as Shift-JIS or some other non-Unicode encoding. _________________ Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too. |
|
Back to top |
|
|
marko.pitkanen |
Posted: Tue Nov 18, 2014 10:35 am Post subject: |
|
|
Chevalier
Joined: 23 Jul 2008 Posts: 440 Location: Jamsa, Finland
|
Thanks everyone for your input.
--
Marko |
|
Back to top |
|
|
rekarm01 |
Posted: Wed Nov 19, 2014 5:19 am Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
WMB/IIB supports conversion for the entire Unicode character set, but WMQ does not. WMQ only supports conversion for the UCS-2 subset.
kimbert wrote: |
WMB and IIB can handle all of Unicode. No restrictions, no qualifications. |
Maybe one little qualification for the IIB's MQRFH2 header parser? Specifically for the NameValueCCSID field, does the value 1200 indicate UCS-2, or UTF-16?
kimbert wrote: |
In particular
- ... it is not limited to the range of characters in UCS-2 ... |
Then at least some of the WMB/IIB documentation may still be out of date, where it refers to the broker/bus using UCS-2 internally. |
|
Back to top |
|
|
PeterPotkay |
Posted: Thu Jun 04, 2020 4:17 am Post subject: |
|
|
Poobah
Joined: 15 May 2001 Posts: 7717
|
kimbert wrote: |
WMB and IIB can handle all of Unicode. No restrictions, no qualifications.
In particular
- it can handle any valid character in the UTF-8 and UTF-16 encodings, regardless of the number of bytes that it occupies
- it can handle any valid character in UTF-32
- it is not limited to the range of characters in UCS-2. That would restrict the product to the Basic Multilingual Plane (BMP) which would be very limiting.
|
6 years later the IIB 10.0.0.20 KC says:
Quote: |
Integration nodes complete string operations in Universal Character Set coded in 2 octets (UCS-2). If incoming strings are not encoded in UCS-2, they are converted to UCS-2 on arrival. |
https://www.ibm.com/support/knowledgecenter/SSMKHH_10.0.0/com.ibm.etools.mft.doc/ac30180_.html
If IIB is not limited to the range of characters in UCS-2, how does it deal with characters outside that range given the above reference? _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
|
rekarm01 |
Posted: Thu Jun 04, 2020 4:14 pm Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
PeterPotkay wrote: |
6 years later the IIB 10.0.0.20 KC says:
Quote: |
Integration nodes complete string operations in Universal Character Set coded in 2 octets (UCS-2). If incoming strings are not encoded in UCS-2, they are converted to UCS-2 on arrival. |
|
Some of the IIB/ACE documentation may still be out of date, where it refers to "UCS-2". The term itself has been obsolete for a while now, according to recent versions of the Unicode standard:
Quote: |
UCS-2 ... was documented in earlier editions of [ISO/IEC] 10646 ... This documentation has been removed from ISO/IEC 10646:2011 and subsequent editions ... It no longer refers to an encoding form in either 10646 or the Unicode Standard. |
The documentation should refer to "UTF-16" instead, and should indicate in some other way where it might only support a subset of the Unicode Character Set.
PeterPotkay wrote: |
If IIB is not limited to the range of characters in UCS-2, how does it deal with characters outside that range given the above reference? |
If the IIB were still using UCS-2, then it would probably have to throw a conversion Exception when trying to convert characters outside that range. |
|
Back to top |
|
|
timber |
Posted: Fri Jun 05, 2020 1:44 am Post subject: |
|
|
Grand Master
Joined: 25 Aug 2015 Posts: 1290
|
Just to confirm (as if there was any doubt)...that rekarm01 is correct. All character data in the IIB/ACE message tree is in UTF-16, not UCS-2. All Unicode characters can be represented in the message tree - no restrictions at all.
In an ideal world, IBM would correct that page in the Knowledge Center. |
|
Back to top |
|
|
PeterPotkay |
Posted: Fri Jun 05, 2020 8:35 am Post subject: |
|
|
Poobah
Joined: 15 May 2001 Posts: 7717
|
Thanks guys, I thought I remember reading / hearing the broker uses UTF-16 internally, just couldn't find anything official in the KC. _________________ Peter Potkay
Keep Calm and MQ On |
|
Back to top |
|
|
rekarm01 |
Posted: Fri Jun 05, 2020 3:49 pm Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 1415
|
IBM MQ v9.0 or later can now handle all of Unicode too, no restrictions:
Quote: |
Before Version 9.0, previous versions of the product did not support conversion of data containing Unicode code points beyond the Basic Multilingual Plane (code points above U+FFFF) ... From Version 9.0, IBM MQ supports all Unicode characters ... |
|
|
Back to top |
|
|
|