# Vocabulary/UnicodeCodePoint

# Unicode Code Point (UCP)

A **Unicode Code Point** (UCP) is a number in the *code space* covered by the Unicode standard.
This attempts to define a *universal character set* for computers.
The home website for this standard is Unicode.org.

Originally the UCP was a number (0 to 65535), i.e. representable by a 16-bit code.
But later on further *frames* were added.
The original code space (0 to 65535), which is (`i.2^16`), has been renamed *Frame 0*. The full unicode code space now covers up to 1114111 (`16b10fff`), an additional 16 frames.

J has three datatypes for UCPs:

*unicode4*type, which can store any UCP in a single code unit.*unicode*type, which covers Frame 0 unicode characters with single code units and other UCP's with surrogate pairs.*literal*type, which can also display unicode characters using utf-8 encoding, producing between 1 and 4 code units (integers ranging from 0 to 255) that specify the UCP eg.`240 159 152 128 { a.`

Byte-precision characters are still available to store ASCII characters and general bytes for interacting with external hardware and software.
We will call a noun with one of the extended precisions a *unicode*, just as we call a noun with the simple 8-bit precision a byte.

The Unicode.org convention for a UCP is to show it **not** as a decimal number, e.g. `960`
but as a string based on its hexadecimal representation, viz. `U+03C0`.

`U+03C0` is the UCP for the symbol **pi** (Ο) as you'll find it in most mathematical texts on the web.
It also happens to be the Greek letter Ο -- but this double-usage cannot be taken for granted.
For instance, it is not true for the engineering symbol Β΅ -- which is not the Greek letter ΞΌ but a special character from a heritage 8-bit character set, now called **Latin 1**.

## To look up a given UCP at Unicode.org

Suppose you want to look up the UCP `U+03C0` at Unicode.org, to find its glyph,
plus the *writing system* or *character set* it belongs to.

Go to the webpage titled Unicode 6.3 Character Code Charts.
Near the top you'll see a field labelled **Find chart by hex code:**

Enter the hex code `03C0` (upper- or lowercase will do) and click **Go** (or press Enter).
This will display a choice of links. Click the first, which (currently) downloads a file: `U0370.pdf`
titled **Greek and Coptic / Range: 0370β03FF**.
There you will discover Ο (`U+03C0`) in column `03C`, row `0`.

## Using a given UCP in a J noun

Once you've discovered your symbol, and have opened its code chart `U0370.pdf`, you will find that the glyphs as displayed can be copy/pasted into other documents, including the J session (IJX) or a J window (IJS).

For instance you can paste your copied symbol into a J string to represent the well-known mathematical formula for the circumference of a circle: *C = 2 Ο r*

```
] z=: 'C=2Οr'
C=2Οr
datatype z NB. shows the precision of z.
literal
NB. ..."literal" as returned by stdlib verb: datatype means "byte".
$z
6
```

**WARNING:**
As you see above, `z` does not automatically become a unicode simply because the symbol Ο has been pasted into it.
Rather it stays as a byte, just as it would if you omitted Ο.

Notice also that `z` contains 6 atoms, not 5, as you'd expect from counting the glyphs in the formula.
The reason is because Ο occupies **two**-byte atoms, not one.

```
3{.z
C=2
4{.z
C=2οΏ½
5{.z
C=2Ο
6{.z
C=2Οr
```

But how can a non-ASCII (or non-Latin 1) symbol such as Ο be stored as a list of bytes?
The answer is, by encoding the byte-list in the utf-8 standard.
The J session, and the IJS window, **always** use this standard when displaying a unicode symbol of the literal type.

A utf-8 string is another way of displaying a UCP.
Each ASCII character resides in just 1 byte, but a UCP outside the ASCII code space requires from 1 to 3 additional bytes.
The symbol Ο happens to occupy 2 bytes.
You can see this clearly if you box each atom of `z` ...

```
<"0 z
+-+-+-+-+-+-+
|C|=|2|οΏ½|οΏ½|r|
+-+-+-+-+-+-+
```

Bug: an invalid UTF-8 sequence (viz. οΏ½ here) corrupts the box structure.

How can J distinguish a utf-8 encoded symbol (** u-symbol**) from an ordinary ASCII character? ASCII characters are encoded with a single byte, characters beyond the range of ASCII are represented by multiple bytes.

The utf-8 standard ensures that if the first byte code of the u-symbol lies within the ASCII code-space, viz. by holding a value less than 128(`16b80`), then it will only require a single byte to encode the code unit. If the first code unit is from 194 (`16bc2`) to 223 (`16bdf`), that means there will be a second code unit in the range of 128 (`16b80`) to 191 (`16bbf`). If the first code unit is in the range of 224 (`16be0`) to 239 (`16bef`) then there will be two code units following it and if the code unit is in the range of 240 (`16bf0`) to 244 (`16bf4`) then there will be three code units following. Notice that the possible ranges for the first code units and the trailing code units are all disjoint, so that missing code units can be immediately detected - either by not having a valid lead code unit or by not having the right number of trailing code units. Encodings that are not well-formed are shown with a placeholder: the *non-displayable character:* οΏ½.
Thus when individually boxed, or extracted using `{` or `{.` , the first character (and often all the characters) of a u-symbol appear as οΏ½.

```
8 u: 65
A
3 u: 8 u: 65 NB. ASCII equivalent
65
datatype 8 u: 65
literal
8 u: 295
Δ§
3 u: 8 u: 295
196 167
datatype 8 u: 295
literal
8 u: 3101
ΰ°
3 u: 8 u: 3101
224 176 157
datatype 8 u: 3101
literal
8 u: 128512
π
3 u: 8 u: 128512
240 159 152 128
datatype 8 u: 128512
literal
```

If you want to make a unicode noun, not bytes, to hold the u-symbol Ο,
then you must explicitly convert the (utf-8 encoded) bytes to a unicode using the *J primitive* (`u:`),
together with the appropriate `x`-argument, in this case `7`.

```
] zz=: 7 u: z
C=2Οr
$zz
5
datatype zz
unicode
```

Notice that `zz` now consists of 5 atoms, as you'd originally hoped for, Ο being represented by a single atom.
You can see this clearly if you box each atom of `zz` ...

```
<"0 zz
+-+-+-+-+-+
|C|=|2|Ο|r|
+-+-+-+-+-+
```

Just as numbers having different precisions can be combined under addition, etc.,
the result having the highest of the precisions,
so unicode and bytes can be combined using (`,`). Thus:

```
] pi=: u: 960
Ο
datatype pi
unicode
datatype each 'C=2' ; 'r'
+-------+-------+
|literal|literal|
+-------+-------+
NB. ..."literal" as returned by stdlib verb: datatype means "byte"
] zzz=: 'C=2' , pi , 'r'
C=2Οr
datatype zzz
unicode
```

## Surrogate Pairs in unicode and unicode4

First, some background. The Unicode standard has a codespace of 0 to 1114111 (`16b10ffff`) with a gap from 55296 (`16bd800`) to 57343 (`16bdfff`) (which is reserved for the surrogate pairs).

This means that only codepoints from 0 to 55295 (`16bd7ff`) and 57344 (`16be000`) to 1114111 (`16b10ffff`) can represent characters.

Three common encoding schemes for Unicode are utf-8, utf-16 and utf-32 (the number indicating the number of bits in the code unit).

The encoding utf-32 has enough bits to represent all of the Unicode code points as single integers, but J's unicode4 type is just an approximation of utf-32, since unicode4 encodes integers between 55296 (`16bd800`) and 57343 (`16bdfff`), while utf-32 does not.

This is convenient because it means that unicode4 can work with surrogate pairs the same way as utf-16 (corresponding to the J unicode type), but it also means that there are times that the unicode4 representation could have a two integer encoding when utf-32 would always be a one integer encoding.

Also, unlike utf-32, J unicode4 accepts integers greater than 1114111 (`16bd10ffff`), although the results have no meaning, as there are no associated characters attached to these code points.

```
9 u: 55357 56832 NB. A surrogate pair representing the happy face emoji
π
$ 9 u: 55357 56832 NB. utf-32 would always be one integer, unicode4 allows two
2
3 u: 9 u: 55357 56832
55357 56832 NB. Keeps result as a surrogate pair
datatype 9 u: 55357 56832
unicode4
9 u: 128512 NB. Proper utf-32 for π
π
$ 9 u: 128512 NB. J unicode4 returns an atom (shape is empty) for a single character
datatype 9 u: 128512 NB. Unicode4 type
unicode4
```

So, why have surrogate pairs? That concept is motivated by utf-16. In order to cover the entire codespace up to 1114111 (`16b10ffff`) by using at most two code units, utf-16 uses surrogate pairs, integers ranging from 55296 (`16bd800`) to 57343 (`16bdfff`).

The first integer of the pair is a value from 55296 (`16bd800`) to 56319 (`16bdbff`) and the second integer ranges from 56320 (`16bdc00`) to 57343 (`16bdfff`).

The ranges of the first and second integers are disjoint, providing an encoding scheme that can easily validate the surrogate pair.

The 1024 x 1024 possible values of surrogate pairs allow a mapping of the code points from `U+10000` to `U+10ffff` that would not be within reach of a single 16 bit code unit, but is covered when using two 16 bit code units.

```
3 u: 7 u: 16bffff NB. Top of range for one 16 bit code unit
65535
3 u: 7 u: 16b10000 NB. Maps to surrogate pairs
55296 56320
7 u: 128512 NB. Using our happy face emoji example
π
$ 7 u: 128512 NB. For J type unicode, a conversion to a surrogate pair is required
2
3 u: 7 u: 128512
55357 56832
datatype 7 u: 128512
unicode
7 u: 55357 56832 NB. Surrogate pairs are also accepted as input directly
π
$ 7 u: 55357 56832
2
3 u: 7 u: 55357 56832
55357 56832
datatype 7 u: 55357 56832
unicode
7 u: 55357 NB. A single component of a surrogate pair is an error
οΏ½οΏ½οΏ½
7 u: 3101 NB. Characters with encodings below 16b10000 are represented by one integer encoding
ΰ°
$ 7 u: 3101 NB. J unicode type always returns lists unlike J unicode4 (see $ 9 u: 128512 result above)
1
3 u: 7 u: 3101
3101
datatype 7 u: 3101
unicode
```

## Verbs to convert literal, unicode and unicode4 encodings to valid Unicode code point.

### utfbox

`utfbox` is a monadic verb returns the boxed encodings given any literal, unicode, unicode4, integer or binary
argument. It does not do any error checking or Unicode code point verification, but it will return the boxed encoding that the
`ucp` verb would evaluate.

```
utfbox=: ((1 (0) } (128&> +. (193&< *. 16bdbff&>) +. 16bdfff&< )) < ;. 1 ] ) "1 @: ((3&u:)^:(1 4 -.@e.~ 3!:0))
utfbox 240 160 190 190 75 55357 56832 236 190 190 128512 3101
βββββββββββββββββ¬βββ¬ββββββββββββ¬ββββββββββββ¬βββββββ¬βββββ
β240 160 190 190β75β55357 56832β236 190 190β128512β3101β
βββββββββββββββββ΄βββ΄ββββββββββββ΄ββββββββββββ΄βββββββ΄βββββ
utfbox 'Γ°Kπΰ°'
βββββββββ¬βββ¬ββββββββββββββββ¬ββββββββββββ
β195 176β75β240 159 152 128β224 176 157β
βββββββββ΄βββ΄ββββββββββββββββ΄ββββββββββββ
```

### ucp

`ucp` is a monadic verb that returns the unicode code point for any literal, unicode, unicode4, integer or binary argument.
It is useful because by converting to unicode code point every literal and unicode character can be
represented by a single integer. This makes operations such as Shape `$` much more consistent because extra
integers created by the different utf-8 and utf-16 encodings do not need to be taken into account. So
when manipulating these unicode code point values it is not possible to break in the middle of
encoding strings, eliminating the issues with non-displayable characters created by rows ending in
middle of well formed encodings.

```
3 3 $ ": 9 u: 128512 NB. Display of literal in 3 X 3 Matrix
οΏ½οΏ½οΏ½
οΏ½οΏ½οΏ½
οΏ½οΏ½οΏ½
3 u: 3 3 $ ": 9 u: 128512 NB. Utf-8 encoding of literal in 3 X 3 Matrix
240 159 152
128 240 159
152 128 240
3 3 $ ucp ": 9 u: 128512 NB. Utf-8 converted to Unicode code point
128512 128512 128512
128512 128512 128512
128512 128512 128512
9 u:"1 [ 3 3 $ ucp ": 9 u: 128512 NB. Utf-8 converted to Unicode code point then unicode4
πππ
πππ
πππ
3 u: 7 u: 128512
55357 56832 NB. utf-16 encoding (surrogate pair)
3 u: ": 9 u: 128512
240 159 152 128 NB. utf-8 encoding
ucp ": 9 u: 128512
128512
ucp 3 u: 7 u: 128512
128512
ucp 240 159 152 128 NB. also allows direct integer entry of encoding
128512
```

This also allows unicode4 in J, which supports surrogate pairs, to be converted into true Unicode code points which can only be single integers.

The ucp verb first establishes that the arguments are either literal, unicode, unicode4, integer or
binary types. If they are not then the message:
`'Valid arguments must be integer, literal, unicode or unicode4.'` is returned.

If the arguments are literal, unicode or unicode4 then `3&u:` is applied to return numerical representation.
Binary and integer types are already in that form so they are passed straight through.

The integers are boxed a row at a time, then opened and processed through `utf` separately to avoid issues
with padding. The final result is a rank 1 list of valid unicode code points, where any encodings that are
not well-formed are replaced with `65533` (the error symbol).

```
ucp=: ('Valid arguments must be integer, literal, unicode or unicode4.'"_) ` (; @:(utf each @: <"1 @: ((3&u:)^:(1 4 -.@e.~ 3!:0)))) @. (1 2 4 131072 262144 e.~ 3!:0 )
```

### utf

The `utf` monadic verb partitions its argument so that any integer between `128` and `193`, or between `16bdbff` and `16bdfff` is
marked with a `0` and all other integers are marked with `1`'s indicating the start of an encoding string. The first position is forced to be
a `1` since it is the start of the line. The partitions are done on the `1` values and this separates the
string of integers into their appropriate encodings. These partitions are then fed to the `check` verb
which returns the Unicode code point if they are well-formed or error encodings if they are not.

```
utf=: ; @: ; @: ((1 (0) } (128&> +. (193&< *. 16bdbff&>) +. 16bdfff&< )) check ;. 1 ] )
```

### check

`check` is a monadic verb that takes the individual encodings and returns the unicode code point if it is a well-formed
encoding and `65533` if it is not well-formed.

The process is based on the value of the first integer in the encoding, as this determines the allowable
trailing integers. Results are returned in boxed form, since errors can be represented as a valid
Unicode code point followed by an error code. Boxing avoids padding results with `0`'s which could be confusing, since `0`'s are also a valid
Unicode code points.

The process is described line by line for the different cases, as conversion between J unicode types and Unicode code point requires careful inspection.

```
check=: 3 : 0
select. 127 193 223 224 236 237 239 240 243 244 16bd7ff 16bdbff 16bdfff 16b10ffff I.{.y
case. 0 do. if. (0 <: y) NB. Single integer less than 128 is valid ucp if greater than 0
do. if. (1 = # y) NB. Check if encoding is one integer
do. < y NB. Encoding is one integer long and is valid ucp
else. ({. y) ; 65533 end. NB. First integer is valid 0-127, trailing integers are error
else. <65533 end. NB. Negative integer is always an error
case. 1;12;14 do. <65533 NB. lead integer cannot be (between 128 and 193) or (between 16bdbff and 16bdfff) or (greater than 16b10ffff) - error is returned
case. 2 do. if. (2 = # y) NB. Check to see if the encoding is 2 integers as it should be with lead integer between 194 and 223
do. if. (1 = 127 191 I. {:y ) NB. Is the second integer between 128 and 191
do. < 3 u: 9 u: y{a. NB. If 2 integers and the second integer is between 128 and 191 then create the ucp
else. <65533 end. NB. If 2 integers and the second integer is not between 128 and 191 then error
else. if. (2 < # y) NB. Check for more than 2 integers
do. t;(65533 ~: t=.; ucp (2{.y){a.) # 65533 NB. If more than 2 integers create a ucp from first two if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 2 integers in the encoding is an error
case. 3 do. if. (3 = # y) NB. Check to see if the encoding is 3 integers as it should be with lead integer of 224
do. if. (1 = 159 191 I. 1{y) *. 1 = 127 191 I. {:y NB. Is the second integer between 158 and 191 and the third integer between 128 and 191
do. < 3 u: 9 u: y{a. NB. If 3 integers and other conditions are met then create the ucp
else. <65533 end. NB. If 3 integers and other conditions are not met then an error
else. if. (3 < # y) NB. Check for more than 3 integers
do. t;(65533 ~: t=.; ucp (3{.y){a.) # 65533 NB. If more than 3 integers create a ucp from first 3 if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 3 integers in the encoding is an error
case. 4;6 do. if. (3 = # y) NB. Check to see if the encoding is 3 integers as it should be with lead integers between 225 - 236 or 238 - 239
do. if. (1 = 127 191 I. 1{y) *. 1 = 127 191 I. {:y NB. Are the second and third integers between 128 and 191
do. < 3 u: 9 u: y{a. NB. If 3 integers and other conditions are met then create the ucp
else. <65533 end. NB. If 3 integers and other conditions are not met then an error
else. if. (3 < # y) NB. Check for more than 3 integers
do. t;(65533 ~: t=.; ucp (3{.y){a.) # 65533 NB. If more than 3 integers create a ucp from first 3 if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 3 integers in the encoding is an error
case. 5 do. if. (3 = # y) NB. Check to see if the encoding is 3 integers as it should be with lead integer of 237
do. if. (1 = 127 159 I. 1{y) *. 1 = 127 191 I. {:y NB. Is the second integer between 128 and 191 and the third integer between 128 and 159
do. < 3 u: 9 u: y{a. NB. If 3 integers and other conditions are met then create the ucp
else. <65533 end. NB. If 3 integers and other conditions are not met then an error
else. if. (3 < # y) NB. Check for more than 3 integers
do. t;(65533 ~: t=.; ucp (3{.y){a.) # 65533 NB. If more than 3 integers create a ucp from first 3 if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 3 integers in the encoding is an error
case. 7 do. if. (4 = # y) NB. Check to see if the encoding is 4 integers as it should be with lead integer of 240
do. if. (1 = 143 191 I. 1{y) *. (1 = 127 191 I. 2{y) *.1 = 127 191 I. {:y NB. Is the second integer between 144 and 191 and the third and fourth integers between 128 and 191
do. < 3 u: 9 u: y{a. NB. If 4 integers and other conditions are met then create the ucp
else. <65533 end. NB. If 4 integers and other conditions are not met then an error
else. if. (4 < # y) NB. Check for more than 4 integers
do. t;(65533 ~: t=.; ucp (4{.y){a.) # 65533 NB. If more than 4 integers create a ucp from first 4 if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 4 integers in the encoding is an error
case. 8 do. if. (4 = # y) NB. Check to see if the encoding is 4 integers as it should be with lead integer between 241 -243
do. if. (1 = 127 191 I. 1{y) *. (1 = 127 191 I. 2{y) *.1 = 127 191 I. {:y NB. Are the second, third and fourth integers between 128 and 191
do. < 3 u: 9 u: y{a. NB. If 4 integers and other conditions are met then create the ucp
else. <65533 end. NB. If 4 integers and other conditions are not met then an error
else. if. (4 < # y) NB. Check for more than 4 integers
do. t;(65533 ~: t=.; ucp (4{.y){a.) # 65533 NB. If more than 4 integers create a ucp from first 4 if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 4 integers in the encoding is an error
case. 9 do. if. (4 = # y) NB. Check to see if the encoding is 4 integers as it should be with lead integer of 244
do. if. (1 = 127 143 I. 1{y) *. (1 = 127 191 I. 2{y) *.1 = 127 191 I. {:y NB. Is the second integer between 128 and 191 and the third and fourth integers between 128 and 191
do. < 3 u: 9 u: y{a. NB. If 4 integers and other conditions are met then create the ucp
else. <65533 end. NB. If 4 integers and other conditions are not met then an error
else. if. (4 < # y) NB. Check for more than 4 integers
do. t;(65533 ~: t=.; ucp (4{.y){a.) # 65533 NB. If more than 4 integers create a ucp from first 4 if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 4 integers in the encoding is an error
case. 10 do. if. (1=# y) NB. Check to see if the encoding is 1 integer as it should be with lead integer between 245 and 16bd7ff
do. < y NB. If single integer then it is a valid ucp
else. ({. y);65533 end. NB. if encoding is more than one then lead integer is ucp and trailing integers are an error
case. 11 do. if. (2=# y) NB. Check to see if the encoding is 2 integers as it should be with lead integer between 16bd800 and 16bdfff
do. if. (16bdbff < 1 { y) NB. Is the second integer greater than 16bdbff (indicates second integer of surrogate pair)
do. < 16b10000 + ,@:(6&}."1)&.#: y NB. If 2 integers and other conditions are met then create the ucp
else. <65533 end. NB. If 2 integers and other conditions are not met then an error
else. if. (2<# y) NB. Check for more than 2 integers
do. t;(65533 ~: t=.; ucp (9 u: 2{.y)) # 65533 NB. If more than 2 integers create a ucp from first 2 if you can and append an error for the extra trailing values, otherwise just an error
else. <65533 end. end. NB. Less than 2 integers in the encoding is an error
case. 13 do. if. (1=# y) NB. Check to see if the encoding is 1 integer as it should be with lead integer between 16be000 and 16b10ffff
do. < y NB. If single integer then it is a valid ucp
else. ({. y);65533 end. NB. Lead integer is valid Unicode code point - trailing integers are invalid
end.
)
```