Internationalization


Two different issues were initially conflated in this page:
  1. Internationalization: translation of kernel messages/actions in different languages
  1. HandlingUTF8 Multilanguage support: how to make Wikka compatible with different charsets
Discussion on the latter has been moved to HandlingUTF8
 



From Wikipedia:
Internationalization and localization both are means of adapting products such as publications or software for non-native environments, especially other nations and cultures.

"Internationalization" is often abbreviated as I18N (or i18n or I18n), where the number 18 refers to the number of letters omitted (conveniently, in either spelling). "Localization" is often abbreviated L10N (etc.) in the same manner.



There have been requests for Wikka to handle language translations. Now the question is, what is the best way to achieve this?

DotMG has made a proposal below. The method proposed is common in PHP coding and should work okay. However before we move forward, it would be nice to have more feedback. Are there any pointers or suggestions for alternative methods? Searching on the web revealed pointers to using gettext, but it's not clear how portable this would be in various web servers environments.

Any other suggestions?

Regardless of what we decide, I think we should use the ISO 639-2 alpha-3 code as a standard for language abbreviations. Check the LanguageCodes page for a table with all the codes.



DotMG's proposal


To make wikka available in more languages, we have to rewrite pages (especially actions/*.*, handlers/page/*.*) and substitute english texts by something like : echo sprintf($this->lang['some_thing'], $this->Format('somethingelse'), 'othertext');
and use a page like langus.inc.php which content will be:
$this->lang = array(
'some_thing' => "In english, the text is '%1\$s' and '%2\$s'!"

);

I made a lot of modifications and these are now available at http://wikka.dotmg.net
But it is not documented and need more tests.
To install it, you just have to overwrite all existing files, and reload homepage.

If you want another language, just add a copy renamed of language/english.php in language directory.
Known bug of this dev version :
handlers/page/edit.php
With $this->lang['edit_preview'] = 'Aperçu'; in french language, the preview can not be shown because $_POST['submit'] == 'Aperçu' but $this->lang['edit_preview'] is it's htmlentity (see above). To correct the problem, you can add htmlentities() to $_POST['submit'] but an error will occur again if the language file contains another character like ç.

Please, inform me by mail if you found some bugs.
info at dotmg dot net


AndreaRossato's Approach


As far as I can see it, multilanguage support is not only the UI translation. You also need to work on character encoding to provide a full multilanguage application.
The best encoding to achieve the goal is utf-8. The problem is that PHP has a limited support for it, and, moreover, mysql stores data as iso-8859-1. To get an idea of what I'm saying check the WikkaMultilanguageTestPage. I inserted sentences in different languages.
Characters are translated into unicode entities. But if you try to edit the page, the unicode entities are not translated back to the original characters. And this make impossible editing the page.

The only way to go is to use a set of functions to take care of character encodings. My approach (you can test it here) is to store data in databse as iso-8859-1 plus unicode entities, present the data in forms as utf-8 and print them as ascii plus unicode entities.

Here some useful information.
-- AndreaRossato

A little precision.

There is a little difference between internationalization and multilanguage.
With multilanguage, you can have many different character encoding in the same page, like AndreaRossato 's WikkaMultiLanguageTestPage. With this, I think the only one way is to use UTF-8 encoding.
But in my opinion, i18n means a wiki that has a base language (and a charset) other than english, ie all in greek, or all in french ...[in other words <edit page> or <page history> translated in one base language other than english]. A first thing to do is change charset in <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> at actions/header.php (iso-8859-1 should be written in config.inc.php).
There is also a problem with functions like htmlentities() as mentionned above, and we should take care of it.
--DotMG

DotMG, you are totally right: i18n and multilanguage are two different concepts. Still, since the effort to provide i18n is not going to be an easy one, I would suggest to have both i18n and multilanguage support. That would mean not to have a new configuration option for character encoding.
Moreover, I think that changing the charset in the metatags is not going to be as simple as one might think. The main issue is data storage in mysql. That is to say, you should create a database with appropriate charset setting. But AFAIK not everyone has access to this option. With my ISP I do not have this option, indeed.
--AndreaRossato

it's two completely different problems! let's try to handle the things one at a time. translating the kernel-messages should make it for almost four continents and solving this problem won't help to deal with charset-conversion. my suggestion is to give the charset-topic it's own page, perhaps HandlingUTF8 and to keep the tasks separate an concise.

Here's some ideas for implementing translated kernel/action messages: DartarI18N
-- DarTar


Gettext Approach

Here are some ideas to add gettext support: WikkaGettext.



Gregor Sysiphos work


A (hopefully) growing list of all phrases in Wikka: PhraseList



GiorgosKontopoulos's simple solution to Internationlization (at least the interface)


A solution to the internationalization problem (at least for making the interface/edit multilingual ) is replacing the Content-type to UTF-8 instead of ISO-8859-1 in actions/header.php on line 13
  1.     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Probably does not work for older versions of MySql and/or PHP (my versions are PHP 4.4.1, MySQL 4.1.14-standard). The search does work, but the characters are unrecognizable when browsing them with phpMyAdmin (tried both UTF-8 and ISO-8859-1 text encoding on the browser).

The solution maybe trivial to some of you but putting it out there just in case. Test solution in my SandBox.

Look in UTF8DatabaseCollation for discussion on how to partially solve the problem by changing the collation of wikka.page.body field (avoid it otherwise since it has side effects)



One page by language


Another nice fonctionality in a Wiki is the possibility of writing the same page in different languages, like with Anwiki.


CategoryDevelopmentI18n
Comments
Comment by NilsLindenberg
2005-01-26 18:25:07
http://wiki.splitbrain.org/wiki:discussion:utf8 perhaps usefull
Comment by JavaWoman
2005-01-27 05:56:14
Thanks, Nils - good info in there!
Comment by PivWan
2005-03-07 17:13:59
I'm using a blog software which handles l10n pretty well. I'll try to extract how it works and do a note here to describe (to be short: one php object and 1 lang file par language and encoding)
Comment by LocK
2005-03-10 05:44:39
I 've used dokuwiki, Nil just mentioned the links before. It can post chinese charset as the content pretty well, without any problems, either does in wikka 1.1.6. But our problem is more serious because we use wiki name as a link inside every page. The parser will search wiki name in every page and turn into as a link. The problem is that besides of using utf in content and in search text, we do want the links name in utf, too. Dokuwiki's solution can be simplier because they use forced link even in regular wiki link. But we don't. I've tried to add link with utf name in doku. It works for the page I add. But it gets weird in other pages liked index. It still have directions. However, what about ours? any clues?
Comment by GeorgePetsagourakis
2005-03-19 08:49:31
A good way to do it would be like how the e107 cms is doing it ... I really enjoyed using it in this aspect recently ...
Comment by JavaWoman
2005-03-20 17:48:41
Many approaches that use a single "language file" to do "translations" fall short on one important aspect of anguages: how to handle plurals.
Think of (for example) a search that tells you the number of hits found before listing them:
"13 occurrences of 'blah' found"
Now imagine that it might also be one:
"1 occurrence of 'blah' found"
or zero:
"no occurrences of 'blah' found" (or "0 occurrences...")
So ... do you create three strings? Or one string with three replaceable parameters? Both could work /for English/. But while English has one singular form and one plural (with zero as a special kind of plural), there are other languages that have different systems. For instance there might be a special form for two. Or for anything up to five but larger than one. And we have to think of nouns, adjectives, articles and verbs.

A good internationalization system *must* take care of such differences by itself since your program code should be oblivious of what language is being used and its properties: you can hardly have a switch statement with a clause for each possible number.

Most systems I've seen cannot do this; they have obviously been designed without much knowledge of the variations that exist in real life between languages and their grammars. They were desigend with (Indo-)European languages in mind (which generally have a similar structure), and can easily allow for Chinese (which doesn't do plurals at all as far as I know). But there are other language groups with quite different structures.
One system I've seen can handle language-dependent plurals, and that is gettext.

If any other system you know of or have mentioned here has such a capability, please let us know.

(I looked at the e107 site but I cannot find a link to 'features' nor a search facility... all I can find is a link to "language files" - not how internationalization is implemented - and I was unable to download a file.... Looking at DokuWiki, this has precisely the problems outlined here: nothing but static strings.)
Comment by GeorgePetsagourakis
2005-03-20 23:27:27
seing it as a problem doesnt come to me at least. Although I know just three spoken languages i see no need for three strings. Two are enough imo. You've spent more time on this than me, but i just wanted to say that it was a pleasure for me to translate the system ( and later find out that someone else has done it for me.. ), as far as i did.
Comment by PivWan
2005-03-26 20:02:46
Concerning gettext, there is one major problem: lots of hosting services don't provide it. It's an heavy extension which can cause CPU overloads and others nice errors. I don't know if PEAR provides a class which handles po/mo files without having the extension. I think we should look in this direction.
Comment by 203.123.41.101
2005-05-27 09:02:22
Data Entry , Book Keeping , Accounting , Data Conversion
Comment by GiorgosKontopoulos
2006-02-13 05:54:07
Today as I edited this page and when I would hit the store button the browser would send the request but I would not get back to the "view" of the page but stay in edit mode with nothing to indicate that something was actually done. The page was nevertheless saved. I did this about 4 different times (4 different edits) and it only gave me the expected behaviour once.

It did not happen for example when I edited the SandBox ? Strange !!! Anyone else had this ? I am using FF1.0.7 WinXP2 if it has anything to do with it.
Comment by DarTar
2006-02-13 06:28:27
Giorgos, I can confirm this behaviour which is likely to be related to the server problems we are having since we migrated the whole website to the new host (http://wikkawiki.org/ServerProblems). I hope that the new website (which will run on a clean install) will solve this annoying issue. Thanks for your understanding.
Comment by ForCen
2006-06-27 07:39:52
Another aspect of localization: standard WikiNames. For example CategoryCategory. These are strings that relate to filenames AND, at the same time, affect code (the "category" preffix).

What are other people thoughts about this? To keep a sort of translation table for this kind of file names, suffixes or to translate filenames and, thus, to modify the code?


On a more administrative side of the discussion:
When Wikka goes full into i18n the installer should ask first for the desired languaga for the basic installation and populate the MySQL tables with the appropiate content. And filenames...

OTOH, I can volunteer to help with spanish translations. Who's in charge of the i18n admin effort?
Comment by ForCen
2006-06-27 11:20:35
I wrote filenames instead of pagenames. Sorry.
Comment by ForCen
2006-06-28 04:26:00
More about localization: date formats.

example in RecentChanges (2 changes, 1 addition):

change:

if(!defined('REVISION_DATE_FORMAT')) define('REVISION_DATE_FORMAT', 'D, d M Y');

to:

if(!defined('REVISION_DATE_LOCALE')) define('REVISION_DATE_LOCALE', 'es_ES'); *
if(!defined('REVISION_DATE_FORMAT')) define('REVISION_DATE_FORMAT', '%a, %d %b %y');


before

foreach ($pages as $i => $page)

add

setlocale(LC_ALL, REVISION_DATE_LOCALE);


change

$dateformatted = date(REVISION_DATE_FORMAT, strtotime($day));

to

$dateformatted = strftime(REVISION_DATE_FORMAT, strtotime($day));


* this is the locale for spain on my apache config. You must use the string for your locale that suits your config.


Hope this helps as a basis for changing other locale related data.

ff
Comment by NilsLindenberg
2006-06-28 20:37:52
I have opened up a ticket for the time settings (thank your for the reminder, ForCen). As of the rest, I'll try to answer it in the next days.
Comment by GeorgePetsagourakis
2010-06-22 23:29:27
I am intrigued to ask how did the development team came to implementing i18n in such a horrible way as it can be seen in the trunk (as of this period in time).

Using global keys for indicating translation strings to use is a maintainance nightmare, beyond standards and fills the global space with rubbish...

Use gettext, or [[http://framework.zend.com/manual/en/zend.translate.html Zend_Translate]]. These too are more or less the standard way to translate php web applications. Let me remind you that Wordpress uses gettext for i18n, and Zend Framework is written from the people behind the PHP language.
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki