Vocabulary/UnicodeCodePoint

From J Wiki
Jump to navigation Jump to search

Back to: Vocabulary

Unicode Code Point (UCP)

A Unicode Code Point (UCP) is a number in the code space covered by the Unicode standard. This attempts to define a universal character set for computers. The home website for this standard is Unicode.org.

Originally the UCP was a number (0 to 65535), i.e. representable by a 16-bit code. But later on further frames were added. The original code space (0 to 65535), which is (i.2^16), has been renamed Frame 0. The full unicode code space now covers up to 1114111 (16b10fff), an additional 16 frames.

J has three datatypes for UCPs:

  1. unicode4 type, which can store any UCP in a single code unit.
  2. unicode type, which covers Frame 0 unicode characters with single code units and other UCP's with surrogate pairs.
  3. literal type, which can also display unicode characters using utf-8 encoding, producing between 1 and 4 code units (integers ranging from 0 to 255) that specify the UCP eg. 240 159 152 128 { a.

Byte-precision characters are still available to store ASCII characters and general bytes for interacting with external hardware and software. We will call a noun with one of the extended precisions a unicode, just as we call a noun with the simple 8-bit precision a byte.

The Unicode.org convention for a UCP is to show it not as a decimal number, e.g. 960 but as a string based on its hexadecimal representation, viz. U+03C0.

U+03C0 is the UCP for the symbol pi (Ο€) as you'll find it in most mathematical texts on the web. It also happens to be the Greek letter Ο€ -- but this double-usage cannot be taken for granted. For instance, it is not true for the engineering symbol Β΅ -- which is not the Greek letter ΞΌ but a special character from a heritage 8-bit character set, now called Latin 1.


To look up a given UCP at Unicode.org

Suppose you want to look up the UCP U+03C0 at Unicode.org, to find its glyph, plus the writing system or character set it belongs to.

Go to the webpage titled Unicode 6.3 Character Code Charts. Near the top you'll see a field labelled Find chart by hex code:

Enter the hex code 03C0 (upper- or lowercase will do) and click Go (or press Enter). This will display a choice of links. Click the first, which (currently) downloads a file: U0370.pdf titled Greek and Coptic / Range: 0370–03FF. There you will discover Ο€ (U+03C0) in column 03C, row 0.


Using a given UCP in a J noun

Once you've discovered your symbol, and have opened its code chart U0370.pdf, you will find that the glyphs as displayed can be copy/pasted into other documents, including the J session (IJX) or a J window (IJS).

For instance you can paste your copied symbol into a J string to represent the well-known mathematical formula for the circumference of a circle: C = 2 Ο€ r

   ] z=: 'C=2Ο€r'
C=2Ο€r
   datatype z  NB. shows the precision of z.
literal
   NB. ..."literal" as returned by stdlib verb: datatype means "byte".
   $z
6

WARNING: As you see above, z does not automatically become a unicode simply because the symbol Ο€ has been pasted into it. Rather it stays as a byte, just as it would if you omitted Ο€.

Notice also that z contains 6 atoms, not 5, as you'd expect from counting the glyphs in the formula. The reason is because Ο€ occupies two-byte atoms, not one.

   3{.z
C=2
   4{.z
C=2οΏ½
   5{.z
C=2Ο€
   6{.z
C=2Ο€r

But how can a non-ASCII (or non-Latin 1) symbol such as Ο€ be stored as a list of bytes? The answer is, by encoding the byte-list in the utf-8 standard. The J session, and the IJS window, always use this standard when displaying a unicode symbol of the literal type.

A utf-8 string is another way of displaying a UCP. Each ASCII character resides in just 1 byte, but a UCP outside the ASCII code space requires from 1 to 3 additional bytes. The symbol Ο€ happens to occupy 2 bytes. You can see this clearly if you box each atom of z ...

   <"0 z
+-+-+-+-+-+-+
|C|=|2|οΏ½|οΏ½|r|
+-+-+-+-+-+-+

Bug: an invalid UTF-8 sequence (viz. οΏ½ here) corrupts the box structure.

How can J distinguish a utf-8 encoded symbol (u-symbol) from an ordinary ASCII character? ASCII characters are encoded with a single byte, characters beyond the range of ASCII are represented by multiple bytes.

The utf-8 standard ensures that if the first byte code of the u-symbol lies within the ASCII code-space, viz. by holding a value less than 128(16b80), then it will only require a single byte to encode the code unit. If the first code unit is from 194 (16bc2) to 223 (16bdf), that means there will be a second code unit in the range of 128 (16b80) to 191 (16bbf). If the first code unit is in the range of 224 (16be0) to 239 (16bef) then there will be two code units following it and if the code unit is in the range of 240 (16bf0) to 244 (16bf4) then there will be three code units following. Notice that the possible ranges for the first code units and the trailing code units are all disjoint, so that missing code units can be immediately detected - either by not having a valid lead code unit or by not having the right number of trailing code units. Encodings that are not well-formed are shown with a placeholder: the non-displayable character: οΏ½. Thus when individually boxed, or extracted using { or {. , the first character (and often all the characters) of a u-symbol appear as οΏ½.

   8 u: 65
A
  3 u: 8 u: 65  NB. ASCII equivalent 
65
  datatype 8 u: 65  
literal
  8 u: 295
Δ§
  3 u: 8 u: 295
196 167
  datatype 8 u: 295
literal 
  8 u: 3101
ఝ
  3 u: 8 u: 3101
224 176 157
  datatype 8 u: 3101
literal
  8 u: 128512
πŸ˜€
 3 u: 8 u: 128512
240 159 152 128
  datatype 8 u: 128512  
literal

If you want to make a unicode noun, not bytes, to hold the u-symbol Ο€, then you must explicitly convert the (utf-8 encoded) bytes to a unicode using the J primitive (u:), together with the appropriate x-argument, in this case 7.

   ] zz=: 7 u: z
C=2Ο€r
   $zz
5
   datatype zz
unicode

Notice that zz now consists of 5 atoms, as you'd originally hoped for, Ο€ being represented by a single atom. You can see this clearly if you box each atom of zz ...

   <"0 zz
+-+-+-+-+-+
|C|=|2|Ο€|r|
+-+-+-+-+-+

Just as numbers having different precisions can be combined under addition, etc., the result having the highest of the precisions, so unicode and bytes can be combined using (,). Thus:

   ] pi=: u: 960
Ο€
   datatype pi
unicode
   datatype each 'C=2' ; 'r'
+-------+-------+
|literal|literal|
+-------+-------+
   NB. ..."literal" as returned by stdlib verb: datatype means "byte"
   ] zzz=: 'C=2' , pi , 'r'
C=2Ο€r
   datatype zzz
unicode

Surrogate Pairs in unicode and unicode4

First, some background. The Unicode standard has a codespace of 0 to 1114111 (16b10ffff) with a gap from 55296 (16bd800) to 57343 (16bdfff) (which is reserved for the surrogate pairs).
This means that only codepoints from 0 to 55295 (16bd7ff) and 57344 (16be000) to 1114111 (16b10ffff) can represent characters.
Three common encoding schemes for Unicode are utf-8, utf-16 and utf-32 (the number indicating the number of bits in the code unit).

The encoding utf-32 has enough bits to represent all of the Unicode code points as single integers, but J's unicode4 type is just an approximation of utf-32, since unicode4 encodes integers between 55296 (16bd800) and 57343 (16bdfff), while utf-32 does not.
This is convenient because it means that unicode4 can work with surrogate pairs the same way as utf-16 (corresponding to the J unicode type), but it also means that there are times that the unicode4 representation could have a two integer encoding when utf-32 would always be a one integer encoding.
Also, unlike utf-32, J unicode4 accepts integers greater than 1114111 (16bd10ffff), although the results have no meaning, as there are no associated characters attached to these code points.

  9 u:  55357 56832 NB. A surrogate pair representing the happy face emoji
πŸ˜€
  $ 9 u: 55357 56832  NB. utf-32 would always be one integer, unicode4 allows two
2
  3 u: 9 u: 55357 56832
55357 56832  NB. Keeps result as a surrogate pair
  datatype 9 u:  55357 56832
unicode4

   9 u:  128512  NB. Proper utf-32 for πŸ˜€
πŸ˜€
  $ 9 u: 128512 NB. J unicode4 returns an atom (shape is empty) for a single character 

  datatype 9 u:  128512  NB. Unicode4 type
unicode4


So, why have surrogate pairs? That concept is motivated by utf-16. In order to cover the entire codespace up to 1114111 (16b10ffff) by using at most two code units, utf-16 uses surrogate pairs, integers ranging from 55296 (16bd800) to 57343 (16bdfff).
The first integer of the pair is a value from 55296 (16bd800) to 56319 (16bdbff) and the second integer ranges from 56320 (16bdc00) to 57343 (16bdfff).
The ranges of the first and second integers are disjoint, providing an encoding scheme that can easily validate the surrogate pair.
The 1024 x 1024 possible values of surrogate pairs allow a mapping of the code points from U+10000 to U+10ffff that would not be within reach of a single 16 bit code unit, but is covered when using two 16 bit code units.

  3 u: 7 u: 16bffff   NB. Top of range for one 16 bit code unit
65535
  3 u: 7 u: 16b10000  NB. Maps to surrogate pairs
55296 56320

  7 u: 128512  NB. Using our happy face emoji example
πŸ˜€
  $ 7 u: 128512 NB. For J type unicode, a conversion to a surrogate pair is required
2
  3 u: 7 u: 128512
55357 56832
  datatype 7 u: 128512  
unicode

   7 u:  55357 56832  NB. Surrogate pairs are also accepted as input directly
πŸ˜€
    $ 7 u:  55357 56832
2
    3 u: 7 u:  55357 56832
55357 56832
   datatype 7 u:  55357 56832
unicode
   7 u:  55357  NB. A single component of a surrogate pair is an error
οΏ½οΏ½οΏ½
   
   7 u: 3101  NB. Characters with encodings below 16b10000 are represented by one integer encoding
ఝ
   $ 7 u: 3101  NB. J unicode type always returns lists unlike J unicode4 (see $ 9 u: 128512 result above)
1
   3 u: 7 u: 3101
3101
   datatype 7 u: 3101
unicode

Verbs to convert literal, unicode and unicode4 encodings to valid Unicode code point.

utfbox

utfbox is a monadic verb returns the boxed encodings given any literal, unicode, unicode4, integer or binary argument. It does not do any error checking or Unicode code point verification, but it will return the boxed encoding that the ucp verb would evaluate.

utfbox=: ((1 (0) } (128&> +. (193&< *. 16bdbff&>) +. 16bdfff&< )) < ;. 1 ] ) "1 @: ((3&u:)^:(1 4 -.@e.~ 3!:0))

    utfbox 240 160 190 190 75 55357 56832 236 190 190 128512 3101
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”
β”‚240 160 190 190β”‚75β”‚55357 56832β”‚236 190 190β”‚128512β”‚3101β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”˜
    utfbox 'Γ°KπŸ˜€ΰ°'
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚195 176β”‚75β”‚240 159 152 128β”‚224 176 157β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

ucp

ucp is a monadic verb that returns the unicode code point for any literal, unicode, unicode4, integer or binary argument. It is useful because by converting to unicode code point every literal and unicode character can be represented by a single integer. This makes operations such as Shape $ much more consistent because extra integers created by the different utf-8 and utf-16 encodings do not need to be taken into account. So when manipulating these unicode code point values it is not possible to break in the middle of encoding strings, eliminating the issues with non-displayable characters created by rows ending in middle of well formed encodings.

      3 3 $ ": 9 u: 128512      NB. Display of literal in 3 X 3 Matrix
  οΏ½οΏ½οΏ½
  οΏ½οΏ½οΏ½
  οΏ½οΏ½οΏ½
      3 u: 3 3 $ ": 9 u: 128512  NB. Utf-8 encoding of literal in 3 X 3 Matrix
  240 159 152
  128 240 159
  152 128 240
      3 3 $ ucp ": 9 u: 128512  NB. Utf-8 converted to Unicode code point
  128512 128512 128512
  128512 128512 128512
  128512 128512 128512
      9 u:"1 [ 3 3 $ ucp ": 9 u: 128512  NB. Utf-8 converted to Unicode code point then unicode4
  πŸ˜€πŸ˜€πŸ˜€
  πŸ˜€πŸ˜€πŸ˜€
  πŸ˜€πŸ˜€πŸ˜€
  
      3 u: 7 u: 128512
  55357 56832             NB. utf-16 encoding (surrogate pair)
      3 u: ": 9 u: 128512 
  240 159 152 128         NB. utf-8 encoding
      ucp ": 9 u: 128512
  128512
      ucp 3 u: 7 u: 128512
  128512
      ucp 240 159 152 128  NB. also allows direct integer entry of encoding
  128512

This also allows unicode4 in J, which supports surrogate pairs, to be converted into true Unicode code points which can only be single integers.

The ucp verb first establishes that the arguments are either literal, unicode, unicode4, integer or binary types. If they are not then the message: 'Valid arguments must be integer, literal, unicode or unicode4.' is returned.

If the arguments are literal, unicode or unicode4 then 3&u: is applied to return numerical representation. Binary and integer types are already in that form so they are passed straight through.

The integers are boxed a row at a time, then opened and processed through utf separately to avoid issues with padding. The final result is a rank 1 list of valid unicode code points, where any encodings that are not well-formed are replaced with 65533 (the error symbol).

ucp=:  ('Valid arguments must be integer, literal, unicode or unicode4.'"_) ` (; @:(utf each @: <"1 @: ((3&u:)^:(1 4 -.@e.~ 3!:0)))) @. (1 2 4 131072 262144 e.~ 3!:0 )

utf

The utf monadic verb partitions its argument so that any integer between 128 and 193, or between 16bdbff and 16bdfff is marked with a 0 and all other integers are marked with 1's indicating the start of an encoding string. The first position is forced to be a 1 since it is the start of the line. The partitions are done on the 1 values and this separates the string of integers into their appropriate encodings. These partitions are then fed to the check verb which returns the Unicode code point if they are well-formed or error encodings if they are not.


utf=: ; @: ; @: ((1 (0) } (128&> +. (193&< *. 16bdbff&>) +. 16bdfff&< )) check ;. 1 ] )

check

check is a monadic verb that takes the individual encodings and returns the unicode code point if it is a well-formed encoding and 65533 if it is not well-formed.

The process is based on the value of the first integer in the encoding, as this determines the allowable trailing integers. Results are returned in boxed form, since errors can be represented as a valid Unicode code point followed by an error code. Boxing avoids padding results with 0's which could be confusing, since 0's are also a valid Unicode code points.

The process is described line by line for the different cases, as conversion between J unicode types and Unicode code point requires careful inspection.

check=: 3 : 0
 select.  127 193 223 224 236 237 239 240 243 244 16bd7ff 16bdbff 16bdfff 16b10ffff I.{.y
 case.    0     do. if. (0 <: y)                            NB. Single integer less than 128 is valid ucp if greater than 0
                        do. if. (1 = # y)                   NB. Check if encoding is one integer
                                do. < y                     NB. Encoding is one integer long and is valid ucp
                                else. ({. y) ; 65533 end.   NB. First integer is valid 0-127, trailing integers are error      
                        else. <65533 end.                   NB. Negative integer is always an error

 case. 1;12;14  do. <65533          NB. lead integer cannot be (between 128 and 193) or (between 16bdbff and 16bdfff) or (greater than 16b10ffff) - error is returned

 case.    2     do. if. (2 = # y)                                               NB. Check to see if the encoding is 2 integers as it should be with lead integer between 194 and 223
                        do. if. (1 = 127 191 I. {:y )                           NB. Is the second integer between 128 and 191 
                                do. < 3 u: 9 u: y{a.                            NB. If 2 integers and the second integer is between 128 and 191 then create the ucp
                                else. <65533 end.                               NB. If 2 integers and the second integer is not between 128 and 191 then error
                        else. if. (2 < # y)                                     NB. Check for more than 2 integers 
                                  do. t;(65533 ~: t=.; ucp (2{.y){a.) # 65533   NB. If more than 2 integers create a ucp from first two if you can and append an error for the extra trailing values, otherwise just an error
                                  else. <65533 end. end.                        NB. Less than 2 integers in the encoding is an error

 case.    3     do. if. (3 = # y)                                               NB. Check to see if the encoding is 3 integers as it should be with lead integer of 224
                        do. if. (1 = 159 191 I. 1{y) *. 1 = 127 191 I. {:y      NB. Is the second integer between 158 and 191 and the third integer between 128 and 191
                                do. < 3 u: 9 u: y{a.                            NB. If 3 integers and other conditions are met then create the ucp
                                else. <65533 end.                               NB. If 3 integers and other conditions are not met then an error
                        else. if. (3 < # y)                                     NB. Check for more than 3 integers
                                  do. t;(65533 ~: t=.; ucp (3{.y){a.) # 65533   NB. If more than 3 integers create a ucp from first 3 if you can and append an error for the extra trailing values, otherwise just an error
                                  else. <65533 end. end.                        NB. Less than 3 integers in the encoding is an error

 case.   4;6    do. if. (3 = # y)                                               NB. Check to see if the encoding is 3 integers as it should be with lead integers between 225 - 236  or 238 - 239
                        do. if. (1 = 127 191 I. 1{y) *. 1 = 127 191 I. {:y      NB. Are the second and third integers between 128 and 191
                                do. < 3 u: 9 u: y{a.                            NB. If 3 integers and other conditions are met then create the ucp
                                else. <65533 end.                               NB. If 3 integers and other conditions are not met then an error
                        else. if. (3 < # y)                                     NB. Check for more than 3 integers
                              do. t;(65533 ~: t=.; ucp (3{.y){a.) # 65533       NB. If more than 3 integers create a ucp from first 3 if you can and append an error for the extra trailing values, otherwise just an error
                              else. <65533 end. end.                            NB. Less than 3 integers in the encoding is an error

 case.    5     do. if. (3 = # y)                                               NB. Check to see if the encoding is 3 integers as it should be with lead integer of 237
                        do. if. (1 = 127 159 I. 1{y) *. 1 = 127 191 I. {:y      NB. Is the second integer between 128 and 191 and the third integer between 128 and 159 
                                do. < 3 u: 9 u: y{a.                            NB. If 3 integers and other conditions are met then create the ucp
                                else. <65533 end.                               NB. If 3 integers and other conditions are not met then an error
                        else. if. (3 < # y)                                     NB. Check for more than 3 integers
                                  do. t;(65533 ~: t=.; ucp (3{.y){a.) # 65533   NB. If more than 3 integers create a ucp from first 3 if you can and append an error for the extra trailing values, otherwise just an error
                                  else. <65533 end. end.                        NB. Less than 3 integers in the encoding is an error
 
 case.    7     do. if. (4 = # y)                                                                     NB. Check to see if the encoding is 4 integers as it should be with lead integer of 240
                        do. if. (1 = 143 191 I. 1{y) *. (1 = 127 191 I. 2{y) *.1 = 127 191 I. {:y     NB. Is the second integer between 144 and 191 and the third and fourth integers between 128 and 191 
                                do. < 3 u: 9 u: y{a.                                                  NB. If 4 integers and other conditions are met then create the ucp
                                else. <65533 end.                                                     NB. If 4 integers and other conditions are not met then an error
                        else. if. (4 < # y)                                                           NB. Check for more than 4 integers
                                  do. t;(65533 ~: t=.; ucp (4{.y){a.) # 65533                         NB. If more than 4 integers create a ucp from first 4 if you can and append an error for the extra trailing values, otherwise just an error
                                  else. <65533 end. end.                                              NB. Less than 4 integers in the encoding is an error

 case.    8     do. if. (4 = # y)                                                                     NB. Check to see if the encoding is 4 integers as it should be with lead integer between 241 -243
                        do. if. (1 = 127 191 I. 1{y) *. (1 = 127 191 I. 2{y) *.1 = 127 191 I. {:y     NB. Are the second, third and fourth integers between 128 and 191
                                do. < 3 u: 9 u: y{a.                                                  NB. If 4 integers and other conditions are met then create the ucp
                                else. <65533 end.                                                     NB. If 4 integers and other conditions are not met then an error
                        else. if. (4 < # y)                                                           NB. Check for more than 4 integers 
                                  do. t;(65533 ~: t=.; ucp (4{.y){a.) # 65533                         NB. If more than 4 integers create a ucp from first 4 if you can and append an error for the extra trailing values, otherwise just an error 
                                  else. <65533 end. end.                                              NB. Less than 4 integers in the encoding is an error

 case.    9     do. if. (4 = # y)                                                                     NB. Check to see if the encoding is 4 integers as it should be with lead integer of 244 
                        do. if. (1 = 127 143 I. 1{y) *. (1 = 127 191 I. 2{y) *.1 = 127 191 I. {:y     NB. Is the second integer between 128 and 191 and the third and fourth integers between 128 and 191  
                                do. < 3 u: 9 u: y{a.                                                  NB. If 4 integers and other conditions are met then create the ucp 
                                else. <65533 end.                                                     NB. If 4 integers and other conditions are not met then an error 
                        else. if. (4 < # y)                                                           NB. Check for more than 4 integers  
                                  do. t;(65533 ~: t=.; ucp (4{.y){a.) # 65533                         NB. If more than 4 integers create a ucp from first 4 if you can and append an error for the extra trailing values, otherwise just an error 
                                  else. <65533 end. end.                                              NB. Less than 4 integers in the encoding is an error

 case.   10     do. if. (1=# y)                 NB. Check to see if the encoding is 1 integer as it should be with lead integer between 245 and 16bd7ff  
                        do. < y                 NB. If single integer then it is a valid ucp 
                        else. ({. y);65533 end. NB. if encoding is more than one then lead integer is ucp and trailing integers are an error
 
 case.   11     do. if. (2=# y)                                                          NB. Check to see if the encoding is 2 integers as it should be with lead integer between 16bd800 and 16bdfff
                        do. if.  (16bdbff < 1 { y)                                       NB. Is the second integer greater than 16bdbff (indicates second integer of surrogate pair)
                                 do.  < 16b10000 + ,@:(6&}."1)&.#: y                     NB. If 2 integers and other conditions are met then create the ucp
                                 else. <65533  end.                                      NB. If 2 integers and other conditions are not met then an error   
                        else. if. (2<# y)                                                NB. Check for more than 2 integers
                                  do. t;(65533 ~: t=.; ucp (9 u: 2{.y)) # 65533          NB. If more than 2 integers create a ucp from first 2 if you can and append an error for the extra trailing values, otherwise just an error 
                                  else. <65533  end. end.                                NB. Less than 2 integers in the encoding is an error 


 case.   13     do. if. (1=# y)                 NB. Check to see if the encoding is 1 integer as it should be with lead integer between 16be000 and 16b10ffff
                        do. < y                 NB. If single integer then it is a valid ucp
                        else. ({. y);65533 end. NB. Lead integer is valid Unicode code point - trailing integers are invalid 
 end.
)