Autodetect encoding when reading Files

Vincentvd · 24 January 2024 11:08

Description of the limitation and why it is relevant to address

As a Developer I want that the File class automatically detects the file encoding so that we don’t have to write functions for such a basic and common action

I think this is relevant for the VIKTOR platform because all developers are reading files and often, we just say file.getvalue(encoding=None). However, this does not properly detect the encoding of files.

Submitter proposed design (optional)

We now use the function below to detect the encoding. It would be nice if encoding=None calls this function in the background.

import charset
def _get_file_encoding(self, file) -> str:
        """
        This function returns the file encoding.
        It raises an error if the encoding is not supported
        """
        raw_data = file.getvalue_binary()
        encoding_result = chardet.detect(raw_data)
        encoding = encoding_result["encoding"]

        try:
            file.getvalue(encoding=encoding)
            return encoding
        except UnicodeDecodeError:
            raise UnicodeDecodeError("Het gef bestand is met een niet ondersteunde enconding gecodeerd")

Current workarounds

Use the custom made function above