IBM locale definitions were used to build all locale objects until 2003, when the Unicode Consortium developed the Common Locale Data Repository (CLDR) for building new UTF-8 locales. Those IBM defined locales (e.g IBM legacy locales) were named language[_territory][.codeset], and the language tag was fully upper case format.
* For example: EN_US (EN_US.UTF-8).
The New CLDR based AIX UTF-8 locales are built from CLDR source, and are named language[_territory][.codeset], and the language tag is fully lower case format.
* For example: en_US.UTF-8.
There is no short name alias for the CLDR UTF-8 locales.
The data in CLDR is gathered through the Unicode Consortium's Survey Tool. http://cldr.unicode.org/index/survey-tool.
Contributors from Unicode Consortium members, other organizations and the public at large
are invited to review the data for their languages and countries, and propose new translations of terms or modifications.
There are variations in locale behavior (for example, collation, date formats, etc.) between the older IBM locale definitions and CLDR definitions.
-Some open source products hard code [a-z][A-Z] as English lower case letters a-z (97-122)and upper case A to Z (65-90) so only ASCII characters are returned.
This does not conform to Open Group standards definitions of regex, which states:
https://pubs.opengroup.org/onlinepubs/007908799/xbd/re.html
*A range expression represents the set of collating elements that fall between two elements in the current collation sequence, inclusively. It is expressed as the starting point and the ending point separated by a hyphen (-).Range expressions must not be used in portable applications because their behavior is dependent on the collating sequence. Ranges will be treated according to the current collating sequence, and include such characters that fall within the range based on that collating sequence, regardless of character values. This, however, means that the interpretation will differ depending on collating sequence. If, for instance, one collating sequence defines ä as a variant of a, while another defines it as a letter following z, then theexpression [ä-z] is valid in the first language and invalid in the second.*
Hard coding these values would omit lower case accented a characters, which is not correct behavior per collation standards.
There are some Unicode regex engines:
-Perl
-PCRE Perl Compatible Regular Expressions
-Java
This enhancement request is for a native AIX Unicode Regular Expression Engine that would provide the level 1 conformance as described in the Unicode® Technical Standard #18: UNICODE REGULAR EXPRESSIONS
https://unicode.org/reports/tr18/
RL1.1 Hex Notation
RL1.2 Properties
RL1.2a Compatibility Properties
RL1.3 Subtraction and Intersection
RL1.4 Simple Word Boundaries
RL1.5 Simple Loose Matches
RL1.6 Line Boundaries
RL1.7 Supplementary Code Points
This shall be a good feature.