View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0000432||LDMud 3.6||General||public||2006-01-06 20:32||2019-09-24 10:27|
|Fixed in Version||3.6.0|
|Summary||0000432: Add string efuns with multibyte character support|
|Description||This is a "parent" bug for multibyte support in string efuns IN GENERAL. The specific efuns and implementations are filed as separate bugs. "Big picture" discussions can go here. My goal is that all text handled and stored by the mudlib should be in one multibyte character set (typically UTF-8), which can then be converted on-the-fly as it is displayed (see bug#426 for conversion issues).|
Here were my design considerations when creating string efuns with multibyte character support. Some of them may be based on faulty assumptions. If so, please correct them and make any required changes to the efuns I've created.
1) "Multibyte" refers to strings encoded in the native multibyte character set specified by the driver host's locale. For different hosts, this could be different values. If a different multibyte encoding from the native one is needed, convert_charset should be used in conjunction with the efun. Sticking to the native charset should make integration with other applications--Perl regular expressions, for example--work transparently. The driver could perhaps allow a configuration option to specify a locale for the driver, but that is not something I'd recommend.
2) I see no reason to create a new "wide character" datatype. A simple array of integers should suffice to represent a wide character string, and provide sufficient values well into the forseeable future. Direct manipulation of wide characters from within the mudlib should be possible, but limited.
3) Old efuns should only be changed when there would be no significant difference in behavior on old systems--otherwise, new efuns should be created. For example, strlen() should not be changed to count the number of multibyte characters in a string because people may be relying on it to return the number of bytes (not characters) in a string. capitalize() can be changed because its behavior outside US-ASCII is undefined, and UTF-8 is a superset of US-ASCII. [Is this a safe assumption? We could also make a wcapitalize()...]
4) Multibyte string efuns should allow strings that contain \0 characters, like existing string efuns (this makes the code much more complicated, but more compatible).
The efuns I've created are:
Returns the number of multibyte characters in the given multibyte string or array of wide characters (in the case of an array, it's simply the size of the array)
Returns the number of screen columns the given multibyte string or array of wide characters will occupy. The behavior of this function is a little funny due to the funny behavior of POSIX wcwidth(). I've modified it to be slightly more useful for our purposes. Characters normally reporting a negative column width are assigned zero width. Tab characters, which normally report a width of zero (because it's variable), are assigned a width of 8.
Takes a single wide character or array of wide characters and returns a multibyte string. If strings are mixed into the array of wide characters, they will be inserted into the resulting string at those positions, unmodified.
Takes a multibyte string and returns an array of wide characters.
Returns a substring of the given multibyte string. The second argument is the start and the third argument is the number of characters. Example: For US-ASCII strings, string[2..3] and substr(string,2,2) would be equivalent. Only the latter is safe to use on multibyte strings.
The efuns I've changed are: capitalize, lower_case, and upper_case.
Please keep in mind: I am not a programmer. If my ideas are flawed, or if my implementations are bad, blame my lack of schooling. Because of this, it's not a bad idea to check my code for simple mistakes any freshman CS student would make.
|Tags||No tags attached.|
|parent of||0000440||closed||Gnomi||LDMud 3.5||Modified efun: upper_case()|
|parent of||0000439||closed||Gnomi||LDMud 3.5||Modified efun: lower_case()|
|parent of||0000438||closed||Gnomi||LDMud 3.5||Modified efun: capitalize()|
|parent of||0000436||closed||Gnomi||LDMud 3.5||New efun: mbstowcs()|
|parent of||0000435||closed||Gnomi||LDMud 3.5||New efun: wcstombs()|
|parent of||0000434||closed||Gnomi||LDMud 3.5||New efun: wcswidth()|
|parent of||0000433||closed||Gnomi||LDMud 3.5||New efun: wcslen()|
|parent of||0000437||closed||Gnomi||LDMud 3.5||New efun: substr()|
|parent of||0000540||closed||Gnomi||LDMud 3.6||lower_case, upper_case and capitalize errors|
|related to||0000066||closed||Gnomi||LDMud 3.5||socket level charset conversion (unicode support)|
|Added bugs 0000433-440 for each function. There are more string functions to potentially convert, but these should cover the basics.|
What about adding a 'multibyte'-flag to the mstring structure to specify, if the string is a multibyte-string?
E.g. if a string is supplied as input to strlen and that flag is set, strlen will act as your wcslen() does, otherwise it will retain its current behaviour.
The only new efun that are needed then would be some 'toggleWCS', which sets or unsets this flag.
If people want the number of bytes in a string, they should use sizeof(), not strlen.
> If people want the number of bytes in a string, they should use sizeof(), not strlen.
This may break things, because currently sizeof is a synonym for strlen (in regard to strings) and is used as such. (I prefer the use of sizeof, because I somehow expect strlen to be declared as obsoleted by sizeof sometime in the future.)
But I like the idea of a multibyte flag instead of additional efuns, because then you can introduce multibyte strings in a whole MUD without greater changes in the mudlib.
The only benefit I see from using the multibyte flag is maybe being able to safely merge wcslen and strlen. We would still need four new efuns: wcswidth, wcstombs, mbstowcs, and substr. We could probably safely modify the existing capitalize, lower_case, and upper_case efuns regardless of whether we implement the multibyte flag or not.
Having existing code "just work" is nice, but in many cases it simply cannot be done. For example, as far as I can tell, all mudlibs assume a string with a length of one character takes up one column on the screen. Having wcslen and wcswidth as different efuns would break this old coding habit, but would not break old strlen-based mudlibs.
Also, I know it is very common to have code like string1[0..strlen(string2)]--which requires that strlen return the number of bytes, not characters, in a string. This is not necessarily a bad coding habit--in C, strlen also merely returns the bytes in a string, and people are used to that. If you want characters in C, you use wcslen.
I'm not saying LPC should always emulate the way C does things, but I think there are valid reasons for separate efuns. I too don't want to break existing mudlibs.
If we do implement a multibyte flag, we would need a quick-and-easy way for people to make all strings multibyte by default if their mudlib can handle it.
||Why should [..] applied to a multibyte string work on byte basis and not with characters? Then str[..strlen(str2)] would work fine.|
Fair enough, I suppose we could change [..] and [..<2], etc to use characters rather than bytes. This would get rid of the need for a new substr function!
So, we're talking about:
- Making a "multibyte" flag for the mstring structure, and a way to set the string on or off by default globally.
- Changing the following efuns to handle both "normal" and multibyte strings: strlen, capitalize, lower_case, and upper_case
- Creating the following new efuns: wcswidth, wcstombs, mbstowcs
- Changing [..]-type operators to function like the sample code originally proposed for substr()
So that's only three new efuns, plus whatever we need to manipulate the multibyte flag. Not bad!
||I suggest to plan multibyte char support also for 3.5, not for 3.3 anymore.|
Is it really necessary to have a separate set of efuns for multibyte?
Quote from the original bug "strlen() should not be changed to count the number of multibyte characters in a string because people may be relying on it to return the number of bytes (not characters) in a string."
I see many more cases where strlen actually refers to the number of characters, and not number of bytes, e.g. max length of a characters name or similar player input and similar string length limits, width of a string (in characters) for string formatting, etc. Plus, code that works with strings as bytes, is likely to be broken anyway as that code is unlikely to handle multibyte strings correctly today.
I'd hate having to replace strlen/sizeof with some new function everywhere where you work with strings as text and not bytes (and in a MUD I think that's practically everywhere).
The only EFuns I can think of that work with bytes instead of characters, are read_bytes, write_bytes, and file_size. And out of these only read_bytes would be prone to returning only half a character.
Personally I'd like to see multibyte strings as default for everything and add special functions only for those who need to do something very unusual for a MUD environment, like working with single bytes or even binary data. The move to multibyte is unlikely to not break anything anyway.
See the third comment, I'm in favor of a multibyte flag so that just the current efuns and operators behave differently.
||Note Added: 0000472|
|2006-03-02 08:37||fippo||Note Added: 0000482|
|2006-03-02 08:56||Gnomi||Note Added: 0000483|
||Note Added: 0000484|
|2006-03-02 17:04||Gnomi||Note Added: 0000485|
||Note Added: 0000490|
|2008-07-01 11:20||zesstra||Note Added: 0000647|
|2008-07-02 03:13||Gnomi||Project||LDMud => LDMud 3.5|
|2008-07-02 03:15||Gnomi||Relationship added||parent of 0000440|
|2008-07-02 03:16||Gnomi||Relationship added||parent of 0000439|
|2008-07-02 03:16||Gnomi||Relationship added||parent of 0000438|
|2008-07-02 03:16||Gnomi||Relationship added||child of 0000437|
|2008-07-02 03:17||Gnomi||Relationship added||parent of 0000436|
|2008-07-02 03:17||Gnomi||Relationship added||parent of 0000435|
|2008-07-02 03:17||Gnomi||Relationship added||parent of 0000434|
|2008-07-02 03:18||Gnomi||Relationship added||parent of 0000433|
|2008-07-02 03:18||Gnomi||Relationship deleted||child of 0000437|
|2008-07-02 03:18||Gnomi||Relationship added||parent of 0000437|
|2008-07-02 06:06||menaures||Note Added: 0000658|
|2008-07-02 06:09||Gnomi||Note Added: 0000659|
|2008-07-02 06:09||Gnomi||Note Edited: 0000659|
|2009-01-08 07:41||Gnomi||Relationship added||related to 0000066|
|2016-10-21 12:30||Gnomi||Assigned To||=> Gnomi|
|2016-10-21 12:30||Gnomi||Status||new => assigned|
|2018-01-29 23:44||zesstra||Project||LDMud 3.5 => LDMud 3.6|
|2018-01-29 23:44||zesstra||Category||Efuns => General|
|2018-01-30 19:37||Gnomi||Relationship added||parent of 0000540|
|2019-09-24 10:27||Gnomi||Status||assigned => resolved|
|2019-09-24 10:27||Gnomi||Resolution||open => fixed|
|2019-09-24 10:27||Gnomi||Fixed in Version||=> 3.6.0|