View Issue Details

IDProjectCategoryView StatusLast Update
0000432LDMud 3.6Generalpublic2019-09-24 10:27
Reporteriago3Assigned ToGnomi  
PrioritynormalSeverityfeatureReproducibilityN/A
Status resolvedResolutionfixed 
Product Version3.4.0 
Fixed in Version3.6.0 
Summary0000432: Add string efuns with multibyte character support
DescriptionThis is a "parent" bug for multibyte support in string efuns IN GENERAL. The specific efuns and implementations are filed as separate bugs. "Big picture" discussions can go here. My goal is that all text handled and stored by the mudlib should be in one multibyte character set (typically UTF-8), which can then be converted on-the-fly as it is displayed (see bug#426 for conversion issues).

Here were my design considerations when creating string efuns with multibyte character support. Some of them may be based on faulty assumptions. If so, please correct them and make any required changes to the efuns I've created.

1) "Multibyte" refers to strings encoded in the native multibyte character set specified by the driver host's locale. For different hosts, this could be different values. If a different multibyte encoding from the native one is needed, convert_charset should be used in conjunction with the efun. Sticking to the native charset should make integration with other applications--Perl regular expressions, for example--work transparently. The driver could perhaps allow a configuration option to specify a locale for the driver, but that is not something I'd recommend.

2) I see no reason to create a new "wide character" datatype. A simple array of integers should suffice to represent a wide character string, and provide sufficient values well into the forseeable future. Direct manipulation of wide characters from within the mudlib should be possible, but limited.

3) Old efuns should only be changed when there would be no significant difference in behavior on old systems--otherwise, new efuns should be created. For example, strlen() should not be changed to count the number of multibyte characters in a string because people may be relying on it to return the number of bytes (not characters) in a string. capitalize() can be changed because its behavior outside US-ASCII is undefined, and UTF-8 is a superset of US-ASCII. [Is this a safe assumption? We could also make a wcapitalize()...]

4) Multibyte string efuns should allow strings that contain \0 characters, like existing string efuns (this makes the code much more complicated, but more compatible).

The efuns I've created are:

int wcslen(string|int*)
Returns the number of multibyte characters in the given multibyte string or array of wide characters (in the case of an array, it's simply the size of the array)

int wcswidth(string|int*)
Returns the number of screen columns the given multibyte string or array of wide characters will occupy. The behavior of this function is a little funny due to the funny behavior of POSIX wcwidth(). I've modified it to be slightly more useful for our purposes. Characters normally reporting a negative column width are assigned zero width. Tab characters, which normally report a width of zero (because it's variable), are assigned a width of 8.

string wcstombs(int|int*)
Takes a single wide character or array of wide characters and returns a multibyte string. If strings are mixed into the array of wide characters, they will be inserted into the resulting string at those positions, unmodified.

int *mbstowcs(string)
Takes a multibyte string and returns an array of wide characters.

string substr(string,int,int)
Returns a substring of the given multibyte string. The second argument is the start and the third argument is the number of characters. Example: For US-ASCII strings, string[2..3] and substr(string,2,2) would be equivalent. Only the latter is safe to use on multibyte strings.

The efuns I've changed are: capitalize, lower_case, and upper_case.

Please keep in mind: I am not a programmer. If my ideas are flawed, or if my implementations are bad, blame my lack of schooling. Because of this, it's not a bad idea to check my code for simple mistakes any freshman CS student would make.
TagsNo tags attached.

Relationships

parent of 0000440 closedGnomi LDMud 3.5 Modified efun: upper_case() 
parent of 0000439 closedGnomi LDMud 3.5 Modified efun: lower_case() 
parent of 0000438 closedGnomi LDMud 3.5 Modified efun: capitalize() 
parent of 0000436 closedGnomi LDMud 3.5 New efun: mbstowcs() 
parent of 0000435 closedGnomi LDMud 3.5 New efun: wcstombs() 
parent of 0000434 closedGnomi LDMud 3.5 New efun: wcswidth() 
parent of 0000433 closedGnomi LDMud 3.5 New efun: wcslen() 
parent of 0000437 closedGnomi LDMud 3.5 New efun: substr() 
parent of 0000540 closedGnomi LDMud 3.6 lower_case, upper_case and capitalize errors 
related to 0000066 closedGnomi LDMud 3.5 socket level charset conversion (unicode support) 

Activities

iago3

2006-01-06 20:53

reporter   ~0000472

Added bugs 0000433-440 for each function. There are more string functions to potentially convert, but these should cover the basics.

fippo

2006-03-02 08:37

reporter   ~0000482

What about adding a 'multibyte'-flag to the mstring structure to specify, if the string is a multibyte-string?

E.g. if a string is supplied as input to strlen and that flag is set, strlen will act as your wcslen() does, otherwise it will retain its current behaviour.

The only new efun that are needed then would be some 'toggleWCS', which sets or unsets this flag.

Note:
If people want the number of bytes in a string, they should use sizeof(), not strlen.

Gnomi

2006-03-02 08:56

manager   ~0000483

> If people want the number of bytes in a string, they should use sizeof(), not strlen.

This may break things, because currently sizeof is a synonym for strlen (in regard to strings) and is used as such. (I prefer the use of sizeof, because I somehow expect strlen to be declared as obsoleted by sizeof sometime in the future.)

But I like the idea of a multibyte flag instead of additional efuns, because then you can introduce multibyte strings in a whole MUD without greater changes in the mudlib.

iago3

2006-03-02 16:04

reporter   ~0000484

The only benefit I see from using the multibyte flag is maybe being able to safely merge wcslen and strlen. We would still need four new efuns: wcswidth, wcstombs, mbstowcs, and substr. We could probably safely modify the existing capitalize, lower_case, and upper_case efuns regardless of whether we implement the multibyte flag or not.

Having existing code "just work" is nice, but in many cases it simply cannot be done. For example, as far as I can tell, all mudlibs assume a string with a length of one character takes up one column on the screen. Having wcslen and wcswidth as different efuns would break this old coding habit, but would not break old strlen-based mudlibs.

Also, I know it is very common to have code like string1[0..strlen(string2)]--which requires that strlen return the number of bytes, not characters, in a string. This is not necessarily a bad coding habit--in C, strlen also merely returns the bytes in a string, and people are used to that. If you want characters in C, you use wcslen.

I'm not saying LPC should always emulate the way C does things, but I think there are valid reasons for separate efuns. I too don't want to break existing mudlibs.

If we do implement a multibyte flag, we would need a quick-and-easy way for people to make all strings multibyte by default if their mudlib can handle it.

Gnomi

2006-03-02 17:04

manager   ~0000485

Why should [..] applied to a multibyte string work on byte basis and not with characters? Then str[..strlen(str2)] would work fine.

iago3

2006-03-03 20:32

reporter   ~0000490

Fair enough, I suppose we could change [..] and [..<2], etc to use characters rather than bytes. This would get rid of the need for a new substr function!

So, we're talking about:
- Making a "multibyte" flag for the mstring structure, and a way to set the string on or off by default globally.
- Changing the following efuns to handle both "normal" and multibyte strings: strlen, capitalize, lower_case, and upper_case
- Creating the following new efuns: wcswidth, wcstombs, mbstowcs
- Changing [..]-type operators to function like the sample code originally proposed for substr()

So that's only three new efuns, plus whatever we need to manipulate the multibyte flag. Not bad!

zesstra

2008-07-01 11:20

administrator   ~0000647

I suggest to plan multibyte char support also for 3.5, not for 3.3 anymore.

menaures

2008-07-02 06:06

reporter   ~0000658

Is it really necessary to have a separate set of efuns for multibyte?

Quote from the original bug "strlen() should not be changed to count the number of multibyte characters in a string because people may be relying on it to return the number of bytes (not characters) in a string."

I see many more cases where strlen actually refers to the number of characters, and not number of bytes, e.g. max length of a characters name or similar player input and similar string length limits, width of a string (in characters) for string formatting, etc. Plus, code that works with strings as bytes, is likely to be broken anyway as that code is unlikely to handle multibyte strings correctly today.

I'd hate having to replace strlen/sizeof with some new function everywhere where you work with strings as text and not bytes (and in a MUD I think that's practically everywhere).

The only EFuns I can think of that work with bytes instead of characters, are read_bytes, write_bytes, and file_size. And out of these only read_bytes would be prone to returning only half a character.

Personally I'd like to see multibyte strings as default for everything and add special functions only for those who need to do something very unusual for a MUD environment, like working with single bytes or even binary data. The move to multibyte is unlikely to not break anything anyway.

Gnomi

2008-07-02 06:09

manager   ~0000659

Last edited: 2008-07-02 06:09

See the third comment, I'm in favor of a multibyte flag so that just the current efuns and operators behave differently.

Issue History

Date Modified Username Field Change
2006-01-06 20:32 iago3 New Issue
2006-01-06 20:53 iago3 Note Added: 0000472
2006-03-02 08:37 fippo Note Added: 0000482
2006-03-02 08:56 Gnomi Note Added: 0000483
2006-03-02 16:04 iago3 Note Added: 0000484
2006-03-02 17:04 Gnomi Note Added: 0000485
2006-03-03 20:32 iago3 Note Added: 0000490
2008-07-01 11:20 zesstra Note Added: 0000647
2008-07-02 03:13 Gnomi Project LDMud => LDMud 3.5
2008-07-02 03:15 Gnomi Relationship added parent of 0000440
2008-07-02 03:16 Gnomi Relationship added parent of 0000439
2008-07-02 03:16 Gnomi Relationship added parent of 0000438
2008-07-02 03:16 Gnomi Relationship added child of 0000437
2008-07-02 03:17 Gnomi Relationship added parent of 0000436
2008-07-02 03:17 Gnomi Relationship added parent of 0000435
2008-07-02 03:17 Gnomi Relationship added parent of 0000434
2008-07-02 03:18 Gnomi Relationship added parent of 0000433
2008-07-02 03:18 Gnomi Relationship deleted child of 0000437
2008-07-02 03:18 Gnomi Relationship added parent of 0000437
2008-07-02 06:06 menaures Note Added: 0000658
2008-07-02 06:09 Gnomi Note Added: 0000659
2008-07-02 06:09 Gnomi Note Edited: 0000659
2009-01-08 07:41 Gnomi Relationship added related to 0000066
2016-10-21 12:30 Gnomi Assigned To => Gnomi
2016-10-21 12:30 Gnomi Status new => assigned
2018-01-29 23:44 zesstra Project LDMud 3.5 => LDMud 3.6
2018-01-29 23:44 zesstra Category Efuns => General
2018-01-30 19:37 Gnomi Relationship added parent of 0000540
2019-09-24 10:27 Gnomi Status assigned => resolved
2019-09-24 10:27 Gnomi Resolution open => fixed
2019-09-24 10:27 Gnomi Fixed in Version => 3.6.0