0000432: Add string efuns with multibyte character support

ID	Project	Category	View Status	Date Submitted	Last Update

0000432	LDMud 3.6	General	public	2006-01-06 19:32	2019-09-24 08:27

Reporter	~~iago3~~	Assigned To	Gnomi
Priority	normal	Severity	feature	Reproducibility	N/A
Status	resolved	Resolution	fixed
Product Version	3.4.0
Fixed in Version	3.6.0

Summary	0000432: Add string efuns with multibyte character support
Description	This is a "parent" bug for multibyte support in string efuns IN GENERAL. The specific efuns and implementations are filed as separate bugs. "Big picture" discussions can go here. My goal is that all text handled and stored by the mudlib should be in one multibyte character set (typically UTF-8), which can then be converted on-the-fly as it is displayed (see bug#426 for conversion issues). Here were my design considerations when creating string efuns with multibyte character support. Some of them may be based on faulty assumptions. If so, please correct them and make any required changes to the efuns I've created. 1) "Multibyte" refers to strings encoded in the native multibyte character set specified by the driver host's locale. For different hosts, this could be different values. If a different multibyte encoding from the native one is needed, convert_charset should be used in conjunction with the efun. Sticking to the native charset should make integration with other applications--Perl regular expressions, for example--work transparently. The driver could perhaps allow a configuration option to specify a locale for the driver, but that is not something I'd recommend. 2) I see no reason to create a new "wide character" datatype. A simple array of integers should suffice to represent a wide character string, and provide sufficient values well into the forseeable future. Direct manipulation of wide characters from within the mudlib should be possible, but limited. 3) Old efuns should only be changed when there would be no significant difference in behavior on old systems--otherwise, new efuns should be created. For example, strlen() should not be changed to count the number of multibyte characters in a string because people may be relying on it to return the number of bytes (not characters) in a string. capitalize() can be changed because its behavior outside US-ASCII is undefined, and UTF-8 is a superset of US-ASCII. [Is this a safe assumption? We could also make a wcapitalize()...] 4) Multibyte string efuns should allow strings that contain \0 characters, like existing string efuns (this makes the code much more complicated, but more compatible). The efuns I've created are: int wcslen(string\|int) Returns the number of multibyte characters in the given multibyte string or array of wide characters (in the case of an array, it's simply the size of the array) int wcswidth(string\|int) Returns the number of screen columns the given multibyte string or array of wide characters will occupy. The behavior of this function is a little funny due to the funny behavior of POSIX wcwidth(). I've modified it to be slightly more useful for our purposes. Characters normally reporting a negative column width are assigned zero width. Tab characters, which normally report a width of zero (because it's variable), are assigned a width of 8. string wcstombs(int\|int) Takes a single wide character or array of wide characters and returns a multibyte string. If strings are mixed into the array of wide characters, they will be inserted into the resulting string at those positions, unmodified. int mbstowcs(string) Takes a multibyte string and returns an array of wide characters. string substr(string,int,int) Returns a substring of the given multibyte string. The second argument is the start and the third argument is the number of characters. Example: For US-ASCII strings, string[2..3] and substr(string,2,2) would be equivalent. Only the latter is safe to use on multibyte strings. The efuns I've changed are: capitalize, lower_case, and upper_case. Please keep in mind: I am not a programmer. If my ideas are flawed, or if my implementations are bad, blame my lack of schooling. Because of this, it's not a bad idea to check my code for simple mistakes any freshman CS student would make.
Tags	No tags attached.

parent of	0000440	closed	Gnomi	LDMud 3.5	Modified efun: upper_case()
parent of	0000439	closed	Gnomi	LDMud 3.5	Modified efun: lower_case()
parent of	0000438	closed	Gnomi	LDMud 3.5	Modified efun: capitalize()
parent of	0000436	closed	Gnomi	LDMud 3.5	New efun: mbstowcs()
parent of	0000435	closed	Gnomi	LDMud 3.5	New efun: wcstombs()
parent of	0000434	closed	Gnomi	LDMud 3.5	New efun: wcswidth()
parent of	0000433	closed	Gnomi	LDMud 3.5	New efun: wcslen()
parent of	0000437	closed	Gnomi	LDMud 3.5	New efun: substr()
parent of	0000540	closed	Gnomi	LDMud 3.6	lower_case, upper_case and capitalize errors
related to	0000066	closed	Gnomi	LDMud 3.5	socket level charset conversion (unicode support)

~~iago3~~ 2006-01-06 19:53 reporter ~0000472	Added bugs 0000433-440 for each function. There are more string functions to potentially convert, but these should cover the basics.

fippo 2006-03-02 07:37 reporter ~0000482	What about adding a 'multibyte'-flag to the mstring structure to specify, if the string is a multibyte-string? E.g. if a string is supplied as input to strlen and that flag is set, strlen will act as your wcslen() does, otherwise it will retain its current behaviour. The only new efun that are needed then would be some 'toggleWCS', which sets or unsets this flag. Note: If people want the number of bytes in a string, they should use sizeof(), not strlen.

Gnomi 2006-03-02 07:56 manager ~0000483	> If people want the number of bytes in a string, they should use sizeof(), not strlen. This may break things, because currently sizeof is a synonym for strlen (in regard to strings) and is used as such. (I prefer the use of sizeof, because I somehow expect strlen to be declared as obsoleted by sizeof sometime in the future.) But I like the idea of a multibyte flag instead of additional efuns, because then you can introduce multibyte strings in a whole MUD without greater changes in the mudlib.

~~iago3~~ 2006-03-02 15:04 reporter ~0000484	The only benefit I see from using the multibyte flag is maybe being able to safely merge wcslen and strlen. We would still need four new efuns: wcswidth, wcstombs, mbstowcs, and substr. We could probably safely modify the existing capitalize, lower_case, and upper_case efuns regardless of whether we implement the multibyte flag or not. Having existing code "just work" is nice, but in many cases it simply cannot be done. For example, as far as I can tell, all mudlibs assume a string with a length of one character takes up one column on the screen. Having wcslen and wcswidth as different efuns would break this old coding habit, but would not break old strlen-based mudlibs. Also, I know it is very common to have code like string1[0..strlen(string2)]--which requires that strlen return the number of bytes, not characters, in a string. This is not necessarily a bad coding habit--in C, strlen also merely returns the bytes in a string, and people are used to that. If you want characters in C, you use wcslen. I'm not saying LPC should always emulate the way C does things, but I think there are valid reasons for separate efuns. I too don't want to break existing mudlibs. If we do implement a multibyte flag, we would need a quick-and-easy way for people to make all strings multibyte by default if their mudlib can handle it.

Gnomi 2006-03-02 16:04 manager ~0000485	Why should [..] applied to a multibyte string work on byte basis and not with characters? Then str[..strlen(str2)] would work fine.

~~iago3~~ 2006-03-03 19:32 reporter ~0000490	Fair enough, I suppose we could change [..] and [..<2], etc to use characters rather than bytes. This would get rid of the need for a new substr function! So, we're talking about: - Making a "multibyte" flag for the mstring structure, and a way to set the string on or off by default globally. - Changing the following efuns to handle both "normal" and multibyte strings: strlen, capitalize, lower_case, and upper_case - Creating the following new efuns: wcswidth, wcstombs, mbstowcs - Changing [..]-type operators to function like the sample code originally proposed for substr() So that's only three new efuns, plus whatever we need to manipulate the multibyte flag. Not bad!

zesstra 2008-07-01 09:20 administrator ~0000647	I suggest to plan multibyte char support also for 3.5, not for 3.3 anymore.

menaures 2008-07-02 04:06 reporter ~0000658	Is it really necessary to have a separate set of efuns for multibyte? Quote from the original bug "strlen() should not be changed to count the number of multibyte characters in a string because people may be relying on it to return the number of bytes (not characters) in a string." I see many more cases where strlen actually refers to the number of characters, and not number of bytes, e.g. max length of a characters name or similar player input and similar string length limits, width of a string (in characters) for string formatting, etc. Plus, code that works with strings as bytes, is likely to be broken anyway as that code is unlikely to handle multibyte strings correctly today. I'd hate having to replace strlen/sizeof with some new function everywhere where you work with strings as text and not bytes (and in a MUD I think that's practically everywhere). The only EFuns I can think of that work with bytes instead of characters, are read_bytes, write_bytes, and file_size. And out of these only read_bytes would be prone to returning only half a character. Personally I'd like to see multibyte strings as default for everything and add special functions only for those who need to do something very unusual for a MUD environment, like working with single bytes or even binary data. The move to multibyte is unlikely to not break anything anyway.

Gnomi 2008-07-02 04:09 manager ~0000659 Last edited: 2008-07-02 04:09	See the third comment, I'm in favor of a multibyte flag so that just the current efuns and operators behave differently.

Date Modified	Username	Field	Change
2006-01-06 19:32	~~iago3~~	New Issue
2006-01-06 19:53	~~iago3~~	Note Added: 0000472
2006-03-02 07:37	fippo	Note Added: 0000482
2006-03-02 07:56	Gnomi	Note Added: 0000483
2006-03-02 15:04	~~iago3~~	Note Added: 0000484
2006-03-02 16:04	Gnomi	Note Added: 0000485
2006-03-03 19:32	~~iago3~~	Note Added: 0000490
2008-07-01 09:20	zesstra	Note Added: 0000647
2008-07-02 01:13	Gnomi	Project	LDMud => LDMud 3.5
2008-07-02 01:15	Gnomi	Relationship added	parent of 0000440
2008-07-02 01:16	Gnomi	Relationship added	parent of 0000439
2008-07-02 01:16	Gnomi	Relationship added	parent of 0000438
2008-07-02 01:16	Gnomi	Relationship added	child of 0000437
2008-07-02 01:17	Gnomi	Relationship added	parent of 0000436
2008-07-02 01:17	Gnomi	Relationship added	parent of 0000435
2008-07-02 01:17	Gnomi	Relationship added	parent of 0000434
2008-07-02 01:18	Gnomi	Relationship added	parent of 0000433
2008-07-02 01:18	Gnomi	Relationship deleted	child of 0000437
2008-07-02 01:18	Gnomi	Relationship added	parent of 0000437
2008-07-02 04:06	menaures	Note Added: 0000658
2008-07-02 04:09	Gnomi	Note Added: 0000659
2008-07-02 04:09	Gnomi	Note Edited: 0000659
2009-01-08 06:41	Gnomi	Relationship added	related to 0000066
2016-10-21 10:30	Gnomi	Assigned To	=> Gnomi
2016-10-21 10:30	Gnomi	Status	new => assigned
2018-01-29 22:44	zesstra	Project	LDMud 3.5 => LDMud 3.6
2018-01-29 22:44	zesstra	Category	Efuns => General
2018-01-30 18:37	Gnomi	Relationship added	parent of 0000540
2019-09-24 08:27	Gnomi	Status	assigned => resolved
2019-09-24 08:27	Gnomi	Resolution	open => fixed
2019-09-24 08:27	Gnomi	Fixed in Version	=> 3.6.0

View Issue Details

Relationships

Activities

Issue History