Unicode in ABAP

Having worked with ABAP most of my career, I have not had to care a lot (if ever) about Unicode, maybe due to the fact that I work in a country where all SAP systems only use English (with very, very rare exceptions).

Only in the last few years have I come across the need to worry about Unicode on the rare occasion, and of course, as a developer, you find that it is actually necessary to know more about it.

Before I launch into this discussion, I must first point to this brilliant article by Joel Spolsky, which as it’s title suggests is “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)“. That article is a very insightful read, starting with the history of encodings. If you have not yet had a chance to do so, do yourself a favour and read it.

Those of us who were around when SAP first introduced Unicode-enabled systems may still labour under the misconception that Unicode is simply about storing every character in two bytes, thus doubling your storage requirements, but of course, that is not true (even though this was the original intent of the guys who came up with the idea), and technically, that may even be the case. Unicode in fact encompasses a whole family of encodings with different mechanisms for storing characters.

As a developer, you can go through your career trying to  totally sidestep the matter of encodings and never have to deal with it, but at some point, you will be confronted with it, and then you can either choose to fight or flee. But of course, you don’t have to flee. As Joel points out in his article: “IT’S NOT THAT HARD.”. Really, it’s not that scary nor is it difficult.

As an example, yesterday I was working on a CRM survey, when in the debugger, I saw this:

Now the first reaction might be that this is binary data. However, I know that what I am looking at is supposed to be an HTML template. In fact, if you take the hexadecimal representation of this string (it is a STRING variable), which is available in the new debugger in the next field, or in the old debugger by clicking the icon next to the field, and paste it as hexadecimal into a hex editor like XVI32 (Edit -> Clipboard -> Paste from hex string), you will see that it is in fact, as suspected, an HTML document.

So what went wrong?

Well, the answer is that on a Unicode ABAP system, the encoding used is UTF-16LE, the result being that each character in a string is in fact stored in two bytes. (The LE stands for Little Endian, which simply refers to the way the bytes are stored. Read more about Endianness). A way we can see this on a Unicode system is by looking at the hex value in the debugger of a character string containing only Latin characters in a variable.

What gives it away is that every second byte is a null byte in this case. The Latin characters themselves require only one byte to store, and the fact that the second byte is the null byte tells us it is little endian. We can take this hex string and put it into XVI32, which will show the Latin characters with every second byte being a null byte.

When we compare this to the string we had above which looked like complete gobbledygook in the debugger, and which for that matter, cannot be processed by ABAP string processing commands, we realize that the other string did not have null (00) bytes every second bytes. The reason: The string was stored with a different encoding. We can guess that it is probably UTF-8, and in fact, that is a pretty safe bet. The reason we see a lot of unexpected characters in the string, is that the ABAP kernel expects it to be in UTF-16LE, and as a result, it interprets each sequence of two bytes accordingly.

So now that we know that, we need a way to convert the strings. Firstly, we need to convert our HTML template, which we assume is stored in UTF-8, to UTF-16LE, so that we can process it in ABAP, and afterwards convert it back, because we see further in the code that the program actually tells the browser to expect the response in UTF-8. (We could change that too, but I didn’t feel like messing too much with the original code).

The tools that SAP provides for handling the conversions in ABAP come in the form of some classes, one of which is CL_ABAP_CONV_OBJ. The way to do this is to instantiate the class, giving it the source and target encodings. The only trick is that you have to know the codes that SAP uses to refer to each encoding. You can get a list of these with transaction SCP. Also, you can see what the code page of the application server is with transaction SNL1:

The following code snippet shows you the basics of doing conversion of one encoding to another for a string:

      data: lv_content type string.
      data: lr_conv type ref to cl_abap_conv_obj.
      create object lr_conv
          incode  = '4110'                                  "UTF-8
          outcode = '4103'.                                 "UTF-16LE
      call method lr_conv->convert
          inbuff    = svy_content
          outbufflg = 0
          outbuff   = lv_content.

Fortunately for you, you don’t have to calculate the output buffer length, even though it is a required parameter. You can simply pass 0 to get back the whole converted string.

At least after the conversion to UTF-16LE, you are able to use ABAP string processing commands like FIND and REPLACE on the string. For more information, refer to this article on SDN which describes CL_ABAP_CONV_OBJ, as well as this one (“Character encoding conversion”).

Well, that’s all from my side. If you are a Unicode expert or have paid better attention to Joel’s Spolsky’s article, you will probably find some glaring inaccuracies in some of my statements above. I am, after all, not an expert on the matter, but knowing what is out there already puts you miles ahead in terms of problem solving.

Martin Ceronio is a husband, father and SAP ABAP programmer by day with a diverse interest in all things that go bleep or bloop.

You may also be interested in: