Home Forum Search Contact Us Advertise
Tutorial

Website Localizations and Multi-language Support

Thumbnail

by Void | in PHP | posted October 25, 2006

Rating starRating starRating starRating starRating star (1 vote) | 4197 views

Some techniques of using custom language "packs" and language markers in database to provide automatic different translations of the website.

Add to del.icio.us | Digg! Digg this | Dot This

You must login in order to rate this.
Having your website content in many languages may sound complex at first. However, PHP provides certain mechanisms that can automate this task and make it straightforward for users, programmers and your translators. This tutorial covers some of the techniques in approaching this problem.

Since this tutorial is for PHP, it is assumed that you use PHP to mange the content of your website. In such environment, when it comes to managing languages, two important aspects of such management appear.

  • Database-driven content
  • PHP-driven content


It is important to understand the difference between two types of content as listed above. Database-driven content is all the content of a website that is pulled from a database, for example news items, blog entries, articles, even "static" pages that were created using some interface of the website. Database-driven content need not be exclusively created by some website interface, it can be "hardcoded" in the database during development, for example page navigation or layout elements.

PHP-driven content is all the content presented on the website that is coded inside PHP files that manage the website. This content comes from PHP variables or echoed strings that were created during development and cannot be (with some exceptions) changed by any website interface. It is "fixed" inside PHP files. The exceptions are self-modifying PHP files, but those are rare and usually limited only to initial installation setup or similar.

We differentiate these two types of content because of different mechanisms required to handle translations.


Managing Multiple Languages

Before we dive into specific techniques for managing translations, we must explain how to track and manage multiple languages first. Managing multiple languages means maintaining a list of languages for which the translations exist, and managing "current language", that is preparing the environment for specific translations to be presented.

For the purpose of this tutorial we will present an open-ended technique without any strict list of available languages. Instead, it uses language "markers" to pull out specific translations, if they exist. Effectively, the only "list" of languages is then the array of little flags, for example, somewhere on the page which users can click to select current language.

The language "marker" that defines current language can come in several "flavors":

  • As addon variable in the URL, for instance http://example.com/index.php?lang=en
  • As part of session tracking
  • Inside special language cookie


Using special cookies that hold the current language marker allows the visitor to have last language selected appear again when the visitor returns. However, that may hinder language tracking by search engines that do not consider cookies in their indexing. The same applies to tracking current language inside sessions, with the exception that when the session expires, so does the information about the visitor's language. Tracking languages as part of URL is probably an "ugly" way however it makes bookmarking specific pages in specific language possible, as well as tracking by the search engines, so this is probably the preferred way of handling languages.

A language marker is piece of information that describes "current language". Tracking through URL, the marker can be the two-lettered language descriptor as in the above example. There is the appearance of a standard with the two-lettered language descriptors as the internet standard of country TLD's, which are two-lettered. However, this marker can really be anything the programmer desires.

The most important thing, then, is to capture this marker at the beginning of each script of your website. Tracking it through URL it becomes a matter of fetching $lang= $_GET['lang']. If there is no language marker in the URL, a default can be set to english 'en', or whatever is default language for your site.

This marker then becomes the ID of the language, both for PHP-driven and database-driven translations, as we will see next in this tutorial. Becoming the ID of the language, it is not necessary to keep lists of available languages, except that little array of flags (or any other way you want to tell your users that there are more languages on your site), because PHP-driven translation will simply load "$lang.lang.ini" language packs, for instance en.lang.ini for English, es.lang.ini for Spanish, fr.lang.ini for French, etc... Database-driven translations can use this marker as primary-key of the translation of particular content, as we will see later.

Before we continue, though, it is necessary to mention the issue of security, especially since PHP-driven translation technique that we will be using in this tutorial simply includes the marker as part of the language pack filename, which can become a security threat for your filesystem. In order to secure this, you will need to pass your marker through a preg_match and ensure it contains only two letters a-z. Here's the code for fetching the secure marker as described in this chapter:

PHP:


<?php 

// Set default to english
$lang'en';

// If language marker is set
if (isset($_GET['lang'])) {
  
// If language marker is two lettered, a-z
  
if (preg_match('/^[a-z]{2}$/i'$_GET['lang']))
    
// Get it from 'lang' variable in the URL
    
$lang$_GET['lang'];
}

?>




PHP-driven Content Translations

As previously mentioned, PHP-driven content is all the website content that is coded inside PHP variables and presented on the website, for example fixed page navigation elements ("top of the page", "previous", "next", "more", "Login", "Username", "Password", ...).

Such content sits in variables, function calls or echoed strings of your PHP content management system, and in order to provide translations it is necessary to "extract" it into a language pack. Such extraction is relatively simple, and the technique presented uses arrays to contain translated words or phrases.

For example, consider the Search link. Instead of "Search", output the array element $language['search']:

PHP:


<?php 
echo "<a href="search.php" title="Search through website">{$language['search']}</a>";
?>



Since your page will load specific language pack depending on the requested language marker, the array element $language['search'] will then contain the language-specific word. For example, en.language.php will contain $language['search']= 'Search'; and fr.language.php will contain for example $language['search']= 'Recherche'; or whatever french phrase is more appropriate for searching an internet website.

The above example uses PHP files filled with arrays for translatable page elements. However, one better technique is using ini files that are parsed with parse_ini_file(). The reason for this is the UTF-8 character encoding. Text files written using UTF-8 encoding contain specific UTF-8 control codes at the beginning of the file. These codes will crash parsing of PHP file, while parse_ini_file() will skip over them and will parse the ini file properly.

This brings us to character encoding. UTF-8 is not required for multi-language websites, though. Proper language character set can be provided inside the ini file (or PHP file if PHP arrays are used):

en.lang.ini
language= "English"
encoding= "ISO-8859-1"
search= "Search"
...

fr.lang.ini
language= "Français"
encoding= "ISO-8859-15"
search= "Recherche"
...


Assuming you parse the ini file into $language array, you can use $language['encoding'] and output it in page header meta tag describing the content type:

PHP:


<?php 
echo "<meta http-equiv=\"Content-Type\" content=\"text/html; charset={$language['encoding']}\" />";
?>



However, UTF-8 has benefits. It is one charset that covers most if not all the language-specific characters. That way if you ever need more than one language presented on the same page, UTF-8 will ensure their specific characters are presented properly, where with non-UTF encoding that might not be the case. In addition, UTF-8 and Unicode are slowly becoming the standard for encoding on the internet, and having up-to-date standardized websites can only be a benefit.

Since language pack should be loaded at the beginning of your CMS scripts, we extend the PHP example given first above:

PHP:


<?php 

// Set default to english
$lang'en';

// If language marker is set
if (isset($_GET['lang'])) {
  
// If language marker is two lettered, a-z
  
if (preg_match('/^[a-z]{2}$/i'$_GET['lang']))
    
// Get it from 'lang' variable in the URL
    
$lang$_GET['lang'];
}

// Check if required language pack exists and load it, or load default pack
if (file_exists("/path/to/language/packs/$lang.lang.ini"))
  
$languageparse_ini_file("/path/to/language/packs/$lang.lang.ini");
else
  
$languageparse_ini_file("/path/to/language/packs/en.lang.ini");

?>



Naturally, file names and variable names used in the above examples are purely arbitrary. Also the language packs are not limited only to containing translated website elements, you can insert all kinds of variables in your ini files that are language-dependant or country-dependant, for instance currency rates, affiliate or distributor addresses or other information (if the website is online shop or product presentation website).


Database-driven Content Translations

Database-driven content translations are translations of content that resides in a database from which it is pulled to be presented on the page. The exact implementation of this kind of language support depends largely on the structure of the database and data presented, however the common property of all such implementations is that text is separated from other content descriptors.

For instance let's take news. A news item can be described with a title, a date, author, perhaps a category and the news body. Usually all this data would go into single news table, however, we need to separate text from other fields. In this case we would use two tables, one with news meta-info (primary key, date and other non-textual information), and one with news textual content like title and body. Category, in this case, is another piece of "atomic" information in that it resides in two tables of its own: meta-data like primary key and text data where actual category titles are stored.

table_news_meta
  • id - primary key
  • author_id - key in authors table
  • category_id - key in categories tables
  • time_stamp - int (for unix timestamp), or DATETIME type


table_news_text
  • id - key in news_meta
  • lang - language marker
  • title - a varchar with translated title
  • body - a text field with translated news body


When new news item is created, first author id, category id and time stamp are inserted into table_news_meta. With this we obtain the primary key of the news item which is unique identifier of that news item. From there we can supply news in different languages where each language-specific news entry will receive separate entry into table_news_text, with id set to the id of the item in table_news_meta, lang set to the language marker ('en' for english, 'fr' for french, etc...) and translated title and news body into their respective fields in the table. You can have a textarea per language in one and the same form, or you can have separate from for open-ended language support where you supply the language marker manually.

And example of MySQL query for fetching language-dependant news item. Let's say that the URL is news.php?id=123&lang=de

PHP:


<?php 
  
// $id contains ID of the news item, $lang is the language marker, both taken from the URL
  
$resmysql_query("SELECT t1.*, t2.*
        FROM table_news_meta AS t1
        LEFT JOIN table_news_text AS t2 ON t2.id=t1.id
        WHERE t1.id=$id AND t2.lang='$lang'
        LIMIT 1"
);
?>



Tracking the Language Through URLs

As mentioned in the introduction, tracking the language marker through URL like in http://www.example.com/index.php?lang=en is possibly the best choice, however one question arises, and that is of track this language marker through all the URLs of your content management system.

One approach to this problem is adding &lang={$lang} to each and every internal URL of your content management system:

PHP:


<?php 
echo "<a href=\"index.php?action=someaction&lang={$lang}\">Some CMS function</a>";
?>




Open-ended vs. Strict Language Tracking

In examples given above we've used so called open-ended language support. This means that we do not limit or track the languages that are available. The marker in lang variable in the URL is used as ID in the database to fetch appropriate translation, and to parse appropriate language pack for PHP driven translations. This latter is done simply by inserting the marker as part of the filename. If filename (language pack) exists it is parsed, if not, default language pack is parsed. Similar default behaviour can be implemented for database-driven translations. If there exists entry for given marker, it will be loaded, if not, default will be loaded. With this approach there are no restrictions, and languages can be added simply by supplying appropriate marker in database, and by uploading appropriate pack in the CMS file system.

One drawback of this approach is that it may fast become unclear which languages are supported and which are not, because there is no strict enforcing. However, any confusion can easily be avodied by having a script that will scan all available posts and language packs, presenting a report clearly labely any discrepancies, for instance if there is a language in database that is not present in language packs, or if there is a discrepancy in represented languages accross, for example, news posts.

Open-ended techniques allow addition of new languages without any intervention into CMS code or database, however if strict language tracking is desired, a table with allowed languages can easily be implemented, and upon each new database-driven content all language variants can automatically be created awaiting proper translation.

 

There are 1 comments to this post.