Determine CCSID depending on content of the file

See this idea on ideas.ibm.com

One of the big challenges when receiving text files via FTP etc from external sources is to determine what CCSID it is encoded in. By default 1252 or 819 is used but often it should be 1208 ( UTF-8 ) instead. Especially when it is XML-files. If the file has byte order markers (BOM) and the CCSID is the default 1252 then XML-SAX in RPG will fail - invalid characters.

I would like a way to set the CCSID depending on the content of the file. So it could be 1208 (UTF-8), 1200 (UTF-16), 819 (Ascii / ISO-8859-1) etc.

An easy way to implement this could be enhancing CHGATR allowing the value *CONTENT for the CCSID.

CHGATR OBJ('/mydir/myfile.xml') ATR(*CCSID) VALUE(*CONTENT)

Use Case:

Enhance CHGATR so CCSID can be determined depending on the files content.
If the file contains byte order markers or sequences for UTF characters then the file is probably in UTF-format.

Idea priority

Medium

Post comment

Guest

Reply
| Nov 3, 2021

After much consideration and discussion, we are going to decline this request.

The BOM is added to the file by an editor when the file is specifically saved with one. The editor is an application, not an operating system. This request is asking the IBM i operating system to analyze data content in a file and to make a guess as to the CCSID of that data. The IBM i operating system doesn't have any knowledge of where that data originated, how it was created, or even what type of data it might be. The data in a stream file can be created in many ways and could be anything. Only the user really knows what it is and whether it was brought on to the system via some method like FTP, or if it was created on the system. The user also only knows what the correct encoding is of that data, and should be setting the CCSID tag on the file appropriately.

0 reply Hide replies

Guest

Reply
| Sep 30, 2020

We will continue to evaluate this request.

0 reply Hide replies

Guest

Reply
| Aug 18, 2020

The CAAC has reviewed this requirement and recommends that IBM view this as a high priority requirement that is important to be addressed. Homogeneous data is becoming more and more the way of the world -- a solution for every operating system will become inevitable. The BOM is bit-wise always the same, no matter what CCSID or otherwise is attached to it so, if present, can be used to help with the algorithm. The three options described below by IBM seem to be good solutions.

Background: The COMMON Americas Advisory Council (CAAC) members have a broad range of experience in working with small and medium-sized IBM i customers. CAAC has a key role in working with IBM i development to help assess the value and impact of individual RFEs on the broader IBM i community, and has therefore reviewed your RFE.

For more information about CAAC, see www.common.org/caac

For more details about CAAC's role with RFEs, see http://www.ibmsystemsmag.com/Blogs/i-Can/May-2017/COMMON-Americas-Advisory-Council-%28CAAC%29-and-RFEs/

Nancy Uthke-Schmucki - CAAC Program Manager

0 reply Hide replies

Guest

Reply
| Jul 30, 2020

There are three RFEs that have similar requests 143259, 135926, and 143226 related to the fact that the CCSID attribute of a file does not reflect the actual contents, which has a BOM that is expected to identify the encoding.

These files in general are created on a platform such as a PC that understands only ASCII and/or ASCII-like CCSIDs such as UT-16 or UTF-8. It is much easier for that platform and the applications to make determinations about the encoding based on content of the file without the need for a CCSID attribute. The IBM i does not have that same environment and the content of any file could be EBCDIC or ASCII or ASCII-like so the CCSID attribute is extremely important when an application reads/writes data out of/into the file in text mode. The data could be any string of bits and bytes and we certainly rely on the user/application to inform us of the encoding of that data. What is a BOM in 1208, is something entirely different in 1200 not to mention that 1208 is not the only CCSID that has BOM defined.

It is important to note that when the CCSID of the file is set correctly there are no problems. As has been noted in at least one of the RFEs, the TYPE command in FTP, the Change Attribute (CHGATR) command, the Qp0lSetAttr()???Set Attributes API, or setccsid Qshell utility are options that can be used to set the CCSID for a file.

There are different suggestions in these requests.
- Create a new directory attribute to direct new files created and linked to be assigned the CCSID based on the BOM in the data or inherited from the parent.
- Determine the CCSID when the file is opened based on the data.
- Have an option on the CHGATR command to set the CCSID based on the *CONTENT.

Since all of these RFEs have basically the same goal, 143259 and 135926 are being marked as duplicates and will set 143226 as Under Consideration. Any further commentary should be put under 143226.

The file system cannot be made to guess at the content nor can we use a CCSID because the data is ???probably' UTF8, etc. Only the users know the content of the files. The file system will need to be extremely careful to not change the current behavior for a solution to work. This means that any solution would most certainly require the users to take some steps to have their desired results.

The file system team will consider this request for future development.

0 reply Hide replies

Guest

Reply
| Jul 29, 2020

Due to processing by IBM, this request was reassigned to have the following updated attributes:
Brand - Servers and Systems Software
Product family - Power Systems
Product - IBM i
Component - IFS (Integrated File System) and Servers
Operating system - IBM i
Source - None

For recording keeping, the previous attributes were:
Brand - Servers and Systems Software
Product family - Power Systems
Product - IBM i
Component - Languages - CL (Control Language)
Operating system - IBM i
Source - None

0 reply Hide replies

Guest

Reply
| Jun 19, 2020

The RFE posted by Niels could also be a solution. I have now voted for this too.
Perhaps even a combination of his RFE and this RFE would be a powerfull solution.

0 reply Hide replies

Guest

Reply
| Jun 19, 2020

Well I have several years ago developed a program that analyses the stream file and returns a CCSID.

First it tests for byte order marks (BOM).
If it starts with x'EFBBBF' then it is CCSID 1208 (UTF-8)
If it starts with x'FEFF' then it is CCSID 1200 (UTF-16 big endian)

If a BOM is not found then it reads the first 16MB of the file and scans the content for UTF-8 sequences.
1st Byte 2nd Byte 3rd Byte 4th Byte
0xxxxxxx <---- ASCII character
110xxxxx 10xxxxxx <---- UTF-8
1110xxxx 10xxxxxx 10xxxxxx <---- UTF-8
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx <---- UTF-8
If one of these sequences occur, except for the ASCII character, then we have an UTF-8 file.

But I would like to have it integrated in the operating system.
Windows can handle it so why shouldn't our favorite platform do the same. :-)

0 reply Hide replies

Guest

Reply
| Jun 18, 2020

I am afraid this will not be easy. Many texts contain the native language as well as quotes or even big extracts in other languages, automation will become very error prone.

0 reply Hide replies

Guest

Reply
| Jun 18, 2020

I have posted an RFE with a general solution tho you issue here:

http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=143259

0 reply Hide replies

Guest

Reply
| Jun 18, 2020

This on is in the same groupe:

https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=135926

So maybe a more generic solution will be better

0 reply Hide replies

Guest

Reply
| Jun 18, 2020

This on goes hand in hand with my request,: Changes the ccsid to 1208 ( or unicode for that matter) if the file have a BOM code, if the folder that contains the file allows that

0 reply Hide replies

Guest

Reply
| Jun 18, 2020

You can never guess the CCSID of a file someone is sending you... "probably" isn't something that works in IT.

The sender should use the TYPE C nnnn command to tell you which CCSID his data is encoded (if it differs from the CHGFTPA setting).

0 reply Hide replies

0 MERGED

Let the IFS set the CCSID depending on BOM codes when writing files

Today CCSID on files in IFS has no automatic connection the content, which means that you manually have to change the CCSID with CHGATR or setccsid command. This is not practical if files are made by FTP or NETSERVER. If you upload a file with FTP...

almost 5 years ago in IBM i / IFS (Integrated File System) and Servers 3 Not under consideration

0 MERGED

Allow UTF-8 with bom to override CCSID attribute for IFS files

Today you can include SQL in PL/SQL Stored procedures, UDTF and compound statements like this; begin include SQL '/prj/sql/NHODATA/VIEWS/KRTSPLV1.sql';end; However if the included SQL file is in UTF-8 with BOM codes you will get this error: SQL St...

over 5 years ago in IBM i / IFS (Integrated File System) and Servers 6 Not under consideration

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Determine CCSID depending on content of the file

Let the IFS set the CCSID depending on BOM codes when writing files

Allow UTF-8 with bom to override CCSID attribute for IFS files

Please enter your email address

RELATED IDEAS

Determine CCSID depending on content of the file