Gambas and UTF-8 (Was Re: [Gambas-user] a bit of confusion)

Benoit Minisini gambas at ...1...
Thu Dec 23 10:37:32 CET 2004


On Thursday 23 December 2004 00:00, Stefan Lamprecht wrote:
> i spend days finding out how to operate correctly with utf-8 strings.
> the information regarding this was close to zero (0) and more misleading in
> some cases.
> i suggest some kind of 'elegant' programming contest to gather some more
> and complete samples.
>
> feed the gambas
>

Gambas uses ASCII internally, but uses UTF-8 for strings that should be 
displayed, and for file names inside components.

For those who don't know what are UTF-8 and Unicode, you can go there: 
http://www.cl.cam.ac.uk/~mgk25/unicode.html

Qt uses Unicode internally, so every string sent to and got from Qt are 
converted to UTF-8. Note that when you type a string in the IDE, you use Qt 
indirectly, and so all what you type is UTF-8. Create a text file with the 
IDE, enter non-ASCII characters, save it, and open it with KWrite, and you 
will see what I mean.

When I say that Gambas uses ASCII internally, I want to mean that EVERY native 
Gambas string functions deals with ASCII: Left$(), Mid$(), Right$(), Instr(), 
Len(), ...

When you want to deal with UTF-8 strings, you have a class named "String" with 
many static methods that process UTF-8 strings instead of ASCII ones: for 
example, Len("é") = 2 and String.Len("é") = 1. Read the wiki for more 
information about each method.

Gambas has a conversion function named... Conv$() that can convert any string 
charset to any other string charset (or almost). For example, Conv$("àéî", 
"UTF-8", "ISO8859-1").

If the Qt component returns UTF-8 strings, the output of shell commands uses 
the charset of the system. To deal with that problem, you must use Conv$() 
and the two following class properties: System.Charset, that returns the 
charset used by the system, and Desktop.Charset, that returns the charset 
used by the GUI component.

On Fedora, Desktop.Charset = System.Charset = "UTF-8", but not on Mandrake 
where System.Charset depends on your system language.

When all Linux systems become UTF-8 based, things will be simpler.

I hope things are clearer. Do not hesitate to ask questions about that.

Regards,

-- 
Benoit Minisini
mailto:gambas at ...1...




More information about the User mailing list