Page MenuHome

Fix T71273: Bad encoding of utf-8 for Text objects
ClosedPublic

Authored by Germano Cavalcante (mano-wii) on Nov 5 2019, 10:10 PM.

Details

Summary

BLI_strncpy_wchar_from_utf8 internally assumes wchar_t is 32 bits
which is not the case on windows.

I don't quite understand this area, but analyzing the code,
it seems the intent is to store an array of unicodes and not a
"platform-dependent" utf.

The solution in thie patch is to make and use functions that use uint (BLI_unicode).

Ref T71273

Diff Detail

Repository
rB Blender

Event Timeline

Not my area either but what you call unicode_array is essentially UTF32, so naming isn't great and leaves a rather large amount of ambiguity what type these variables actually are.

i'd rather see BLI_strncpy_utf32_as_utf8 rather than BLI_unicode_array_as_str_utf8 and BLI_UTF32 rather than BLI_unicode

but i'll let @Campbell Barton (campbellbarton) to be the deciding factor here. so hold off on doing any changes based on my feedback.

  • Silence Function definition not found warning

Agree the API should be explicit WRT utf8,16,32... etc.

Since C & C++ have char32_t, we could use this. If it's not defined in some cases it could be typedef'd (probably in BLI_compiler_compat.h, or one of the BLI_string headers).

Examples to uses in C/C++.

If this introduces too many problems supporting different compilers/libc versions, we could define BLI_char32_t which can be replaced by char32_t later, when compiler support has improved.


For API naming:

  • BLI_str_utf8_as_utf32 (was BLI_str_utf8_as_unicode_array)
  • BLI_str_utf32_as_utf8 (was BLI_unicode_array_as_str_utf8)
  • BLI_str_utf32_as_utf8_len (was BLI_unicode_array_utf8_len)

Some existing uses of unicode should be renamed to utf32 as well.

Following the revision:

  • API naming + char32_t use.

We still need to check the support on different compilers/libc versions.

  • Revert some unrelated changes

Couple of things that stick out:

  1. there's a bunch of functions with wchar in their name that take a char_32t type, a better name will have to be picked for those.
  2. for memory allocations sizeof(wchar_t) is still used here and there in the char32_t codepaths which with allocate a too small buffer as a result.
  • Include the header <uchar.h> instead of creating own definitions;
  • Fix other functions that use wchar assuming utf-32.
  • rename some wchar members, params and variables to utf32
  • Fix bad use of sizeof(wchar_t)
  • Fix EditFont Undo
  • Rename wchar_t_step_next and wchar_t_step_prev to cursor_step_next_utf32 and cursor_step_prev_utf32
This revision is now accepted and ready to land.Nov 22 2019, 2:36 AM

Only checked this on Linux, if this works on other platforms - LGTM.


Note that some further refactoring would be good after this commit, eg:

BLI_str_utf8_as_utf32 and BLI_str_utf8_as_unicode are confusing, BLI_str_utf8_as_unicode could be renamed to BLI_str_utf8_as_utf32_codepoint, but that can be handled separately.

  • Fix compilation error on MacOS
  • Missed in last commit
  • Fix unknown type name 'size_t'