MQSeries.net :: View topic - IIB 9 Unicode and UTF-8 support clarification needed

MQSeries.net

Tech Exchange

Education

Certifications

Library

Info Center

SupportPacs

FAQÂ Â

Usergroups

RSS Feed - WebSphere MQ Support

RSS Feed - Message Broker Support

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » IIB 9 Unicode and UTF-8 support clarification needed

Goto page 1, 2 Next

IIB 9 Unicode and UTF-8 support clarification needed

« View previous topic :: View next topic »

Author

Message

marko.pitkanen

Posted: Tue Nov 18, 2014 1:13 am Post subject: IIB 9 Unicode and UTF-8 support clarification needed

Chevalier

Joined: 23 Jul 2008
Posts: 440
Location: Jamsa, Finland

Hi All,

I didn't do a proper investigation if this subject have already been covered here. If so please feel free to give pointer to the appropriate thread.

Question is what characters / languages are supported by the broker
through MQ, broker wide HTTP -listener and file nodes? This is not obvious from the documentation because for example:

In the documentation page for IIB9 it says that for example UTF-8 has built-in support (no restrictions mentioned).

In the document for MQ Unicode Conversion support it says that the support for UTF-16 and UTF-8 in WebSphere MQ is limited to those Unicode characters that can be encoded in UCS-2.

Are there any restrictions which characters can be used in the application messages processed by / with IIB 9?

--
Marko

smdavies99

Posted: Tue Nov 18, 2014 1:28 am Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

marko.pitkanen wrote:

Are there any restrictions which characters can be used in the application messages processed by / with IIB 9?

AFAIK, no there aren't.

BUT there is a big IF though. The IF is reflected by my guess that several hundred posts here about this sort of thing.

The biggest issue I've seen is the fact that the CCSID of the contents differ from the CCSID in the message header/descriptor.
This causes no end of problems.
For example

It is no use having

Code:

<?xml version="1.0" encoding="ISO8859-1 ?>

When the rest of the XML data is actually UTF-8 encoded.

If is no use having an MQMD.CodeCharSetId=1208 then the message body is coded as 923.

Finally, there are times when we see CCSID Conversion done by a channel when it need not have been done.

It is surprising how many architects and developers simply don't understand this.
I first started woking on multinational Character sets around 1981 and there are still times when I get it wrong.

_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

marko.pitkanen

Posted: Tue Nov 18, 2014 1:46 am Post subject:

Chevalier

Joined: 23 Jul 2008
Posts: 440
Location: Jamsa, Finland

Thanks,

In theory how would for example the 4-byte UTF-8 Kanji characters

Code:

Normal Kanji ==> UTF-8 octets: 3 bytes ==> UTF-8 code point: up to 0xFFFF (2 bytes)
Rare Kanji ==> UTF-8 octets: 4 bytes ==> UTF-8 code point: above 0xFFFF (4 bytes)

work in IIB's http or file interface?

--
Marko

smdavies99

Posted: Tue Nov 18, 2014 2:45 am Post subject:

Jedi Council

Joined: 10 Feb 2003
Posts: 6076
Location: Somewhere over the Rainbow this side of Never-never land.

The Kanji character are not UTF-8, they are UTF-16 or UTF-32

The UTF stream uses BOM's (Byte Order Marks)
http://en.wikipedia.org/wiki/Byte_order_mark

to switch between the different types. Remember to get the endian correct though.

Why don't you try it for yourself?
Look at the message tree
_________________
WMQ User since 1999
MQSI/WBI/WMB/'Thingy' User since 2002
Linux user since 1995

Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.

Vitor

Posted: Tue Nov 18, 2014 6:20 am Post subject:

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

marko.pitkanen wrote:

In theory how would for example the 4-byte UTF-8 Kanji characters

Code:

Normal Kanji ==> UTF-8 octets: 3 bytes ==> UTF-8 code point: up to 0xFFFF (2 bytes)
Rare Kanji ==> UTF-8 octets: 4 bytes ==> UTF-8 code point: above 0xFFFF (4 bytes)

work in IIB's http or file interface?

They would (like all data) be passed into IIB over http from a system that supported those code points or read from a file on an OS that supported those code point, and then stored in the IIB message tree as UTF-16.

If you then tried to serialise the data into an http put or write it to a file using a CCSID that doesn't support all the code points which are in the message tree at the time of serialisation (and note that this CCSID has no connection with the CCSID of the inbound data) then IIB will abend in the traditional manner & roll back.
_________________
Honesty is the best policy.
Insanity is the best defence.

marko.pitkanen

Posted: Tue Nov 18, 2014 6:51 am Post subject:

Chevalier

Joined: 23 Jul 2008
Posts: 440
Location: Jamsa, Finland

Thanks,

We will try to produce a problem. Question is rather theoretical, just to make sure that the same restrictions as for MQ

Code:

The support for UTF-16 and UTF-8 in WebSphere MQ is therefore limited to those Unicode characters that can be encoded in UCS-2.

is or isn't true for the broker.

--
Marko

Vitor

Posted: Tue Nov 18, 2014 7:01 am Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed

Grand High Poobah

Joined: 11 Nov 2005
Posts: 26093
Location: Texas, USA

marko.pitkanen wrote:

In the documentation page for IIB9 it says that for example UTF-8 has built-in support (no restrictions mentioned).

It also lists all of the UTF-16, UTF-32 and some ISO I've never heard of.
_________________
Honesty is the best policy.
Insanity is the best defence.

kimbert

Posted: Tue Nov 18, 2014 8:10 am Post subject:

Jedi Council

Joined: 29 Jul 2003
Posts: 5543
Location: Southampton

WMB and IIB can handle all of Unicode. No restrictions, no qualifications.
In particular
- it can handle any valid character in the UTF-8 and UTF-16 encodings, regardless of the number of bytes that it occupies
- it can handle any valid character in UTF-32
- it is not limited to the range of characters in UCS-2. That would restrict the product to the Basic Multilingual Plane (BMP) which would be very limiting.

smdavies99 said:

Quote:

The Kanji character are not UTF-8, they are UTF-16 or UTF-32

That statement could be misinterpreted.

UTF-8, UTF-16 and UTF-32 encode *all* of Unicode. So any character that is valid in one of those encodings is also valid in both of the others. But the character will be *encoded* differently ( it will be represented by a different byte sequence) in each case.

Kanji characters, and any other characters outside of the BMP, will be decoded from their original code page/encoding and will appear in the message tree as a UTF-16 'surrogate pair'. Note that the original code page does not need to be UTF-8 or UTF-16; the character could come in as Shift-JIS or some other non-Unicode encoding.
_________________
Before you criticize someone, walk a mile in their shoes. That way you're a mile away, and you have their shoes too.

marko.pitkanen

Posted: Tue Nov 18, 2014 10:35 am Post subject:

Chevalier

Joined: 23 Jul 2008
Posts: 440
Location: Jamsa, Finland

Thanks everyone for your input.

--
Marko

rekarm01

Posted: Wed Nov 19, 2014 5:19 am Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed

Grand Master

Joined: 25 Jun 2008
Posts: 1415

marko.pitkanen wrote:

In the documentation page for IIB9 it says that for example UTF-8 has built-in support (no restrictions mentioned).

In the document for MQ Unicode Conversion support it says that the support for UTF-16 and UTF-8 in WebSphere MQ is limited to those Unicode characters that can be encoded in UCS-2.

WMB/IIB supports conversion for the entire Unicode character set, but WMQ does not. WMQ only supports conversion for the UCS-2 subset.

kimbert wrote:

WMB and IIB can handle all of Unicode. No restrictions, no qualifications.

Maybe one little qualification for the IIB's MQRFH2 header parser? Specifically for the NameValueCCSID field, does the value 1200 indicate UCS-2, or UTF-16?

kimbert wrote:

In particular
- ... it is not limited to the range of characters in UCS-2 ...

Then at least some of the WMB/IIB documentation may still be out of date, where it refers to the broker/bus using UCS-2 internally.

PeterPotkay

Posted: Thu Jun 04, 2020 4:17 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7722

kimbert wrote:

6 years later the IIB 10.0.0.20 KC says:

Quote:

Integration nodes complete string operations in Universal Character Set coded in 2 octets (UCS-2). If incoming strings are not encoded in UCS-2, they are converted to UCS-2 on arrival.

https://www.ibm.com/support/knowledgecenter/SSMKHH_10.0.0/com.ibm.etools.mft.doc/ac30180_.html

If IIB is not limited to the range of characters in UCS-2, how does it deal with characters outside that range given the above reference?
_________________
Peter Potkay
Keep Calm and MQ On

rekarm01

Posted: Thu Jun 04, 2020 4:14 pm Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed

Grand Master

Joined: 25 Jun 2008
Posts: 1415

PeterPotkay wrote:

6 years later the IIB 10.0.0.20 KC says:

Quote:

Integration nodes complete string operations in Universal Character Set coded in 2 octets (UCS-2). If incoming strings are not encoded in UCS-2, they are converted to UCS-2 on arrival.

Some of the IIB/ACE documentation may still be out of date, where it refers to "UCS-2". The term itself has been obsolete for a while now, according to recent versions of the Unicode standard:

Quote:

UCS-2 ... was documented in earlier editions of [ISO/IEC] 10646 ... This documentation has been removed from ISO/IEC 10646:2011 and subsequent editions ... It no longer refers to an encoding form in either 10646 or the Unicode Standard.

The documentation should refer to "UTF-16" instead, and should indicate in some other way where it might only support a subset of the Unicode Character Set.

PeterPotkay wrote:

If IIB is not limited to the range of characters in UCS-2, how does it deal with characters outside that range given the above reference?

If the IIB were still using UCS-2, then it would probably have to throw a conversion Exception when trying to convert characters outside that range.

timber

Posted: Fri Jun 05, 2020 1:44 am Post subject:

Grand Master

Joined: 25 Aug 2015
Posts: 1292

Just to confirm (as if there was any doubt)...that rekarm01 is correct. All character data in the IIB/ACE message tree is in UTF-16, not UCS-2. All Unicode characters can be represented in the message tree - no restrictions at all.

In an ideal world, IBM would correct that page in the Knowledge Center.

PeterPotkay

Posted: Fri Jun 05, 2020 8:35 am Post subject:

Poobah

Joined: 15 May 2001
Posts: 7722

Thanks guys, I thought I remember reading / hearing the broker uses UTF-16 internally, just couldn't find anything official in the KC.
_________________
Peter Potkay
Keep Calm and MQ On

rekarm01

Posted: Fri Jun 05, 2020 3:49 pm Post subject: Re: IIB 9 Unicode and UTF-8 support clarification needed

Grand Master

Joined: 25 Jun 2008
Posts: 1415

marko.pitkanen wrote:

In the document for MQ Unicode Conversion support it says that the support for UTF-16 and UTF-8 in WebSphere MQ is limited to those Unicode characters that can be encoded in UCS-2.

IBM MQ v9.0 or later can now handle all of Unicode too, no restrictions:

Quote:

Before Version 9.0, previous versions of the product did not support conversion of data containing Unicode code points beyond the Basic Multilingual Plane (code points above U+FFFF) ... From Version 9.0, IBM MQ supports all Unicode characters ...

Display posts from previous:

Goto page 1, 2 Next

Page 1 of 2

MQSeries.net Forum Index » WebSphere Message Broker (ACE) Support » IIB 9 Unicode and UTF-8 support clarification needed

Jump to:

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

Protected by Anti-Spam ACP