PHP Classes

Understanding Language Detection - PHP Language Detection Library package blog

Recommend this page to a friend!
  All package blogs All package blogs   PHP Language Detection Library PHP Language Detection Library   Blog PHP Language Detection Library package blog   RSS 1.0 feed RSS 2.0 feed   Blog Understanding Languag...  
  Post a comment Post a comment   See comments See comments (0)   Trackbacks (0)  

Author:

Viewers: 38

Last month viewers: 1

Package: PHP Language Detection Library

There are several ways to detect what language text is written in and it is all made even more simple using a web service. The question is, how do they do it? This article will not only discuss using one of these web services but also look at how they do it.




Loaded Article

Introduction

Using the PHP Language Detection Library

Methods to Detect Languages

Conclusion

languageLayer

Introduction

The PHP Language Detection Library uses a web service provided by languageLayer. Send them some text and they will return a list of languages that are candidates, as well as a main match that is most likely the language being used.

In addition to using this simple package, I will also take a brief a look at how the detection process works.

Using the PHP Language Detection Library

Using the package to detect the language of any text is very simple. Provide the text and read the result returned from the languageLayer API.

You will first need to set up your own subscription account at https://languagelayer.com/product where you will receive your unique access key. You add this key in the langlayer.class.php file, replacing YOUR_API_KEY_HERE with your unique access key.

private $apiKey = 'YOUR_API_KEY_HERE';

You can now instantiate the class and send any text to the API to be checked.

include('langlayer.class.php'); 
$lang = new languageLayer();
$text = 'Ich bin mir sicher, dass dies die Sprache Deutsch';
$lang->getResponse($text);

The response will be in the $lang->response object which we can see by dumping to the screen

var_dump($lang->response);

Languages use a specific alphabet, Latin for example, so the results can contain more than one language with each language detected containing:

language_code = 2 digit language code

language_name = The full English name for the language

probability = a numerical weighted probability, the higher the number the more likely text is this specific language

percentage = the percentage between 0% and 100% which represents the API's confidence

reliable_result = true or false depending on whether the API is confident in the main match

The more text provided, the greater probability that the language will be accurately identified.

Methods to Detect Languages

There are several ways that text can be evaluated to determine the language it is written in. The simplest is to look at the character set, for example the Latin and Cyrillic languages contain different characters. Using this method, we can differentiate between English and Russian, however it will not be easy to tell the difference between English and Spanish, which are both Latin languages.

Another method is to look for specific character combinations known as digraphs and trigraphs. A digraph is 2 characters side by side and a trigraph is a set of 3 sequential characters.

Certain character groups will appear more often in one language than another which allows an algorithm to determine the likelihood the text belongs to a specific language. This method provides a better way to determine languages within the same character family.

The more languages which are supported, the more likely that certain languages will have similar digraph and trigraph sets. To further separate these similar languages we need to look for specific words that are more common in a specific language. As these words are located, our confidence grows that we have identified the correct language.

Conclusion

Since languageLayer supports over 170 languages, they have to use all the methods described in this article. The formula is simple, by comparing character sets, digraphs, trigraphs and unique words, any text can be evaluated to determine the language it is written in.

The hard part in writing your own application is developing accurate digraphs, trigraphs and unique word sets. These 'secret' sets are the power behind accurately detecting a language.

Fortunately for us, we just need to query the web service provided by languageLayer and let them do the heavy lifting in the background.





You need to be a registered user or login to post a comment

Login Immediately with your account on:



Comments:

No comments were submitted yet.



  Post a comment Post a comment   See comments See comments (0)   Trackbacks (0)  
  All package blogs All package blogs   PHP Language Detection Library PHP Language Detection Library   Blog PHP Language Detection Library package blog   RSS 1.0 feed RSS 2.0 feed   Blog Understanding Languag...