AIX Unicode Regular Expression Support

IBM locale definitions were used to build all locale objects until 2003, when the Unicode Consortium developed the Common Locale Data Repository (CLDR) for building new UTF-8 locales. Those IBM defined locales (e.g IBM legacy locales) were named language[_territory][.codeset], and the language tag was fully upper case format.
* For example: EN_US (EN_US.UTF-8).

The New CLDR based AIX UTF-8 locales are built from CLDR source, and are named language[_territory][.codeset], and the language tag is fully lower case format.

* For example: en_US.UTF-8.

There is no short name alias for the CLDR UTF-8 locales.

The data in CLDR is gathered through the Unicode Consortium's Survey Tool. http://cldr.unicode.org/index/survey-tool.

Contributors from Unicode Consortium members, other organizations and the public at large
are invited to review the data for their languages and countries, and propose new translations of terms or modifications.

There are variations in locale behavior (for example, collation, date formats, etc.) between the older IBM locale definitions and CLDR definitions.

-Some open source products hard code [a-z][A-Z] as English lower case letters a-z (97-122)and upper case A to Z (65-90) so only ASCII characters are returned.

This does not conform to Open Group standards definitions of regex, which states:
https://pubs.opengroup.org/onlinepubs/007908799/xbd/re.html

*A range expression represents the set of collating elements that fall between two elements in the current collation sequence, inclusively. It is expressed as the starting point and the ending point separated by a hyphen (-).Range expressions must not be used in portable applications because their behavior is dependent on the collating sequence. Ranges will be treated according to the current collating sequence, and include such characters that fall within the range based on that collating sequence, regardless of character values. This, however, means that the interpretation will differ depending on collating sequence. If, for instance, one collating sequence defines ä as a variant of a, while another defines it as a letter following z, then theexpression [ä-z] is valid in the first language and invalid in the second.*

Hard coding these values would omit lower case accented a characters, which is not correct behavior per collation standards.

There are some Unicode regex engines:
-Perl
-PCRE Perl Compatible Regular Expressions
-Java

This enhancement request is for a native AIX Unicode Regular Expression Engine that would provide the level 1 conformance as described in the Unicode® Technical Standard #18: UNICODE REGULAR EXPRESSIONS
https://unicode.org/reports/tr18/
RL1.1 Hex Notation
RL1.2 Properties
RL1.2a Compatibility Properties
RL1.3 Subtraction and Intersection
RL1.4 Simple Word Boundaries
RL1.5 Simple Loose Matches
RL1.6 Line Boundaries
RL1.7 Supplementary Code Points

Idea priority

Medium

Post comment

Guest

Reply
| Oct 6, 2020

This shall be a good feature.

0 reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

AIX Unicode Regular Expression Support

Please enter your email address

RELATED IDEAS

AIX Unicode Regular Expression Support