Multilingual support on websites
is increasingly important in contemporary websites. In this article, I
discuss the issues in supporting multilingual sites, together with the
rationale of the solution I selected to implement in the new version of
irqnine.
One of the most complex features being considered in the new version of irqnine (my personal website) is multilingual support. Although irqnine may not have future needs for multiple languages, it is a nice practice, and increases the value of the system. Issues to consider when implementing multilingual support are:
Separating application domains and the data store means that the system can support localized content as well. A news site in Vietnam will be able to hold content relevant only to the residents of Vietnam. With a unique URL for different application domains as a requirement (such as vi.news.com), search engines can easily aggregate content in different languages.
This system is simple and has good performance, but does not easily mix languages for resources. For example, a German natively speaks German but also reads English, and prefers German content, if available. The resource he is viewing is only available in English. In this approach, his only option is to view the entire site in English, presenting every resource as English, including user interface elements and hyperlinked titles of works that were originally German and available in German.
While not particularly problematic, a better system is to display the content in English and the rest (user interface, hyperlinked titles) in German, if available. In order to support this in the multi-site approach, the application must not only access the German database, but also determine if the resource is available in German in the first place. Further, when the user is transferred to the German site, the application cannot transfer the user's session state without complicated remoting solutions1.
Advantages:
However, Wikipedia's localization efforts are more than just linguistic translations; it's cultural as well. There are various factors that determine the tone of an article, mostly due to cultural biases. For example, an article on Imperial China in English may have a discussion on China proper, but such a sensitive term is not mentioned at all in the Chinese version.
Furthermore, Wikipedia's style does not have a default language. Each language site is independently grown, and only linked to each other as an (mostly) afterthought. This is ideal for Wikipedia's model since the growth of information is dependent on the audience' perception of the importance, which will occasionally create culturally incompatible concepts. The direction irqnine is taking is high-fidelity, with direct translations sought after.
Advantages:
Advantages:
If the preferred language is not available, the system informs the user that the language is not available, and possibly presents alternative languages.
The need to check all the available languages against the preferred language incurs a performance hit of O(n) - where n is the number of available languages - against the multi-site approach of O(1). Data is stored in a relational database, with every language supported in a separate table, and each language occupying one row in a normalized table. Thus, a table of all languages of a keyword will have n number of rows in the table, where n is the number of languages available. However, space use is somewhat equivalent to multi-site solutions, which require n data stores for all available languages.
The site operator must also identify elements which need to be translated, such as the link menu. An English message specifying an error in creating the link menu of the site is of little use to any user. To at least not show the error message to the user, the system can automate the tasks required to add support for a language by copying the resources from a stable language set and notifying the site operator which resources require localization.
Advantages:
This works assuming users who visited the site understand the default language. Websites that started out in one language and expanded to support other languages are prime candidates of this approach. Support for the other languages can be partial.
Like the contextual multi-site approach, the guarantee that the base language is available implies that either content creators must be well-versed in the base language or there are translators to translate the content creator's content into the base language. Further, a localized content (for example, an advertisement of a car sale in Bangalore) must be translated into the base language, even though no one outside India will be interested.
Advantages:
If the same user requests a resource with availability {VI, EN}, the system will present the resource as EN since it is one of the languages in the preferences, knowing that the user will be able to view the content in EN but not at all in VI.
This does degrade performance, as each resource call must go through a decision algorithm of worst-case complexity O(mn), where m is the number of preferred languages and n the number of available languages.
Advantages:
The requirement for irqnine is to support language mixing without incurring too much performance overhead. Further, for sites that do not require support for multiple languages, the system should be able to perform as well as one that does not support multiple languages.
The most straightforward solution to this is to find the approach that supports the requirement minimally, and optimize for performance by enabling other features. The language determination algorithm can be reduced to O(n) by removing the adaptive feature:
But in real terms, the number of users who require more than two preferred languages (1 of it the default) is marginally small. Support for adaptive mirror is desirable in this case because we can maximize user flexibility without incurring a huge performance hit.
For sites that do not require multilingual support, the feature can be disabled altogether:
By designing language selection algorithms as classes and using the Strategy Design Pattern, site operators can disable multilingual support without recompiling any code or incurring much performance overhead.
If localization of content is important and the cost of maintaining two application domains is acceptable, the operator can create as many sites with disabled multilingual support as needed:
This unfortunately suffers from problems inherent in multi-site solutions, namely non-shareable user session state, no linking support, etc.
Finally, for sites where multilingual support and ad hoc relations between resources are required, the site operator can disable defaulting.
A switch from a site that previously supported defaulting to a non-defaulting configuration may have negative impact on user experience, as they are presented an alternative language selection screen instead of the previous behaviour, which is to present the default language. One way to solve this problem is to add the default language to the bottom of all users' language preference list, but users who do not understand the default language will not be able to continue from there. Therefore, it is better to let users opt in rather than opt out, entering the desired language into the list if necessary.
Introduction
After analyzing requirements, perhaps the most important task is to analyze the risks of implementing features. By coming out with educated guesses of costs and results of a feature before implementing it, we can avoid a lot of problems down the road.One of the most complex features being considered in the new version of irqnine (my personal website) is multilingual support. Although irqnine may not have future needs for multiple languages, it is a nice practice, and increases the value of the system. Issues to consider when implementing multilingual support are:
- Globalization support. Globalization is an activity to make an application localizable. Things to consider are identifying control labels that are not universal, such as the link menu, or the content of an article.
- Localization support. Localization is the activity of adapting an application to a locale. This is further broken into two categories - localization of interface and content.
- Localization of interface . Localization of an interface is an activity for making sure a resource is translated to another language with high fidelity. Examples of resources that require this are button labels, article text, article author names, etc.
- Localization of content. Localization of content is the ability to publish content that has high relevance to a particular locale but not in other locales. For example, a news site in Vietnam will publish news that may not interest anyone outside Vietnam. This, however, is not a multilingual issue, but consequential by making an application multilingual.
- Translation fidelity. As smart as the translators come, some content simply cannot be translated. Poems maybe translated literally, but the context in which it is presented is almost always completely lost. v6 should take this into account and allow content creators to specify whether the translation is accurate, and if not, specify the language in which the work is originally created in. Additionally, multilingual users who understand the language the work is created in may decide to view the resource in the original language.
- Search engine compatibility. Most search engines do not implement cookies. Therefore, to expose search engines to all the languages available for a particular resource, there must be a text URL to the languages, rather than generating the language depending on the user's personalization settings. Including session IDs into URLs conflicts v6's requirement of readable and shareable URLs.
- Target audience. Selection of the solution depends on the target audience of the system. A site that attracts people from English speaking countries will most likely start out with an English system. Later, it is realized that the site is attracting a significant number of users from China. Management then decides to support Chinese as well, but may roll out the support incrementally. For example, all user interface elements are localized first. Resources that are frequently accessed by users in China are then localized in order of popularity.
- Performance. Performance is one of the constraints of designing a system. A killer feature may not be practically useful if the responsiveness is slow or it becomes too expensive to run.
- Scalability. Scalability in multilingual support generally refers to the work required to support additional languages. This has implications to other requirements such as performance, maintainability, and extensibility.
- Ease of implementation. Since I'm a one-developer software house, it is important that I do not have to spend too much resources in developing this feature.
- Others. Other issues come standard in any feature of any system - availability, security, maintainability, accessibility, extensibility, and reliability.
Multi-Site Approach
This approach is probably the approach most used. An application domain (whether independent application domains running on the same process in the .NET sense, or in separate processes, or separate locations altogether) has to be created for each language supported. A user will have to decide which language to use before actually entering the system.Separating application domains and the data store means that the system can support localized content as well. A news site in Vietnam will be able to hold content relevant only to the residents of Vietnam. With a unique URL for different application domains as a requirement (such as vi.news.com), search engines can easily aggregate content in different languages.
This system is simple and has good performance, but does not easily mix languages for resources. For example, a German natively speaks German but also reads English, and prefers German content, if available. The resource he is viewing is only available in English. In this approach, his only option is to view the entire site in English, presenting every resource as English, including user interface elements and hyperlinked titles of works that were originally German and available in German.
While not particularly problematic, a better system is to display the content in English and the rest (user interface, hyperlinked titles) in German, if available. In order to support this in the multi-site approach, the application must not only access the German database, but also determine if the resource is available in German in the first place. Further, when the user is transferred to the German site, the application cannot transfer the user's session state without complicated remoting solutions1.
Advantages:
- Simple structure, easy to implement. System only needs to be globalization aware.
- High performance. No language to display to user that is determined before the resource is accessed.
- Allows full localization - localization of languages and content; even the physical location of the system.
- Naturally supports distributed deployment - but no collaborations.
- Full search engine compatibility.
- Languages supported in separate application domains. User sessions state sharing between languages is complex and incurs performance overhead.
- High maintenance. Changes (configuration, new version of code) in one application domain must be replicated in other application domains.
- Not scalable. Support for n languages requires n application domains.
- Users are not able to find the matching resource in another language.
- Need to keep track of different content on different languages.
- Sites with static content or content which they need to translate is satisfied before the site is launched, rather than on a frequent and ongoing basis. For example, government sites in Canada supporting English and French languages.
- Sites in which the number of supported languages is stable. For example, government sites in Canada support French and English and that is relatively sufficient.
Context-aware Multi-site Approach
Wikipedia2 has excellent multilingual support. Each article is linked to the equivalent articles in all languages available. This is actually an extension to the multi-site style.However, Wikipedia's localization efforts are more than just linguistic translations; it's cultural as well. There are various factors that determine the tone of an article, mostly due to cultural biases. For example, an article on Imperial China in English may have a discussion on China proper, but such a sensitive term is not mentioned at all in the Chinese version.
Furthermore, Wikipedia's style does not have a default language. Each language site is independently grown, and only linked to each other as an (mostly) afterthought. This is ideal for Wikipedia's model since the growth of information is dependent on the audience' perception of the importance, which will occasionally create culturally incompatible concepts. The direction irqnine is taking is high-fidelity, with direct translations sought after.
Advantages:
- Users are able to find different language versions of a resource.
- Suited for sites with independent growth.
- Not every resource is translated. Ad hoc relationships between resources of multiple languages.
- Wikipedia! Sites where localized content is not simply translating between resources.
Contextual Multi-site Approach
This approach solves the loose coupling problem above by dictating strong relationships between the original resource and translations of the same resource. Each resource will have a unique identifier across all localized data stores, enabling users and content creators to find absolutely relevant translations.Advantages:
- Translations of the same resource are strongly linked.
- Implementation probably complex.
- Performance overhead probably high.
- Every resource available over all languages means that either all resources must be localized or there cannot be localized content.
- The guarantee that an application domain displays and only displays the localized language means that the site operator must have enough manpower to localize all available languages for each new resource.
- Sites where the content is highly moderated.
Mirror Approach
Instead of separating each language into different application domains, all language sites are combined into one application domain. This instantly solves the session state sharing problem. User's language preferences are stored in the server, allowing the system to determine the language to present to the user.If the preferred language is not available, the system informs the user that the language is not available, and possibly presents alternative languages.
The need to check all the available languages against the preferred language incurs a performance hit of O(n) - where n is the number of available languages - against the multi-site approach of O(1). Data is stored in a relational database, with every language supported in a separate table, and each language occupying one row in a normalized table. Thus, a table of all languages of a keyword will have n number of rows in the table, where n is the number of languages available. However, space use is somewhat equivalent to multi-site solutions, which require n data stores for all available languages.
The site operator must also identify elements which need to be translated, such as the link menu. An English message specifying an error in creating the link menu of the site is of little use to any user. To at least not show the error message to the user, the system can automate the tasks required to add support for a language by copying the resources from a stable language set and notifying the site operator which resources require localization.
Advantages:
- One application domain to support all languages. No problems suffered from multi-site approaches.
- Allows localized content.
- Strong relationships between original content and translations.
- Allows incremental updates.
- Centralized data store - not scalable.
- Performance overhead in selecting a language for the user and determining if the language available may be undesirable.
- Degraded user experience; user presented an exception. No automatic redirection to a more friendly resource.
- Incremental updates require identification of resources that must be translated.
Defaulting Mirror
By specifying a default language of a particular resource, the user is presented the resource in the default language if a resource is not available in the language of preference.This works assuming users who visited the site understand the default language. Websites that started out in one language and expanded to support other languages are prime candidates of this approach. Support for the other languages can be partial.
Like the contextual multi-site approach, the guarantee that the base language is available implies that either content creators must be well-versed in the base language or there are translators to translate the content creator's content into the base language. Further, a localized content (for example, an advertisement of a car sale in Bangalore) must be translated into the base language, even though no one outside India will be interested.
Advantages:
- Allows the automatic "trickle down" model. A base language available for all resources, with support for other languages supported incrementally.
- Improved user experience; the user is presented relevant content regardless.
- Assumes the user understands the default language.
- No localized content, or content becomes universal.
Adaptive Mirror
Defaulting mirror can be expanded to include support for users with multilingual abilities. For example, a German (DE) speaks German natively but also knows French (FR), but very little English (EN). He will then set his language preference as {DE, FR, EN}. Although the site's default language is EN and the resource is available in EN, the content is available in DE as well. The system presents DE to the user in this case, improving the user experience.If the same user requests a resource with availability {VI, EN}, the system will present the resource as EN since it is one of the languages in the preferences, knowing that the user will be able to view the content in EN but not at all in VI.
This does degrade performance, as each resource call must go through a decision algorithm of worst-case complexity O(mn), where m is the number of preferred languages and n the number of available languages.
Advantages:
- Resource in the preferred language is presented to the user first.
- Default language still presented to user if the preferred languages are unavailable.
- Greater performance overhead - O(mn). However, when considering performance cost, feature desirability, and overall system cost, this may be negligible.
Distributed Mirror
Some site operators may further require mirrors to be located at an ideal location, such as locating the server and data store in Japan to support the Japanese version. This is the most complex approach, requiring strong knowledge of distributed computing. Since irqnine will not be doing that anytime soon (if ever), this approach will not be considered.Hybrid
To summarize, multi-site solutions are simple to build and deploy, but difficult to publish content on. The mixing of languages can become overly complex. The mirror approach is harder to build, but improves user experience by mixing languages to the user's preference. The ideal solution is to allow the system adapt to the environment it is deployed into, making use of the advantages and disadvantages of both.The requirement for irqnine is to support language mixing without incurring too much performance overhead. Further, for sites that do not require support for multiple languages, the system should be able to perform as well as one that does not support multiple languages.
The most straightforward solution to this is to find the approach that supports the requirement minimally, and optimize for performance by enabling other features. The language determination algorithm can be reduced to O(n) by removing the adaptive feature:
But in real terms, the number of users who require more than two preferred languages (1 of it the default) is marginally small. Support for adaptive mirror is desirable in this case because we can maximize user flexibility without incurring a huge performance hit.
For sites that do not require multilingual support, the feature can be disabled altogether:
By designing language selection algorithms as classes and using the Strategy Design Pattern, site operators can disable multilingual support without recompiling any code or incurring much performance overhead.
If localization of content is important and the cost of maintaining two application domains is acceptable, the operator can create as many sites with disabled multilingual support as needed:
This unfortunately suffers from problems inherent in multi-site solutions, namely non-shareable user session state, no linking support, etc.
Finally, for sites where multilingual support and ad hoc relations between resources are required, the site operator can disable defaulting.
A switch from a site that previously supported defaulting to a non-defaulting configuration may have negative impact on user experience, as they are presented an alternative language selection screen instead of the previous behaviour, which is to present the default language. One way to solve this problem is to add the default language to the bottom of all users' language preference list, but users who do not understand the default language will not be able to continue from there. Therefore, it is better to let users opt in rather than opt out, entering the desired language into the list if necessary.
Final Thoughts
Globalization and localization is a complex issue. .NET Framework automates many of the tasks in a globalized application, such as resource repository and automatic user language selections. However, the backend support for localized content still falls on the developer's shoulder, and as such requires a very thorough understanding of the subject before deciding on finding the correct solution and implementing it.Footnotes
- ASP.NET supports out-of-process session state through State Server and the SQL Server state service. This is designed to support Web farm scenarios, not multilingual sites - but it is possible to use it for this purpose.
- Wikipedia may ultimately be using solutions discussed in "Mirror Approach", but it is used here just for example purposes. However, a tell-tale sign is when one signs into en.wikipedia.org, going to zh.wikipedia.org does not automatically sign the user in. This shows that user session state is not shared across the two sites.
No comments:
Post a Comment