Open live version
Determine the Language of a Text
Make a function that determines the language that a text is written in.
code
french = Import["http://fr.wikipedia.org/wiki/Main_Page"];
english = Import["http://en.wikipedia.org/wiki/Main_Page"];
german = Import["http://de.wikipedia.org/wiki/Main_Page"];
spanish = Import["http://es.wikipedia.org/wiki/Main_Page"];
language =
Classify[{french -> "French", english -> "English",
german -> "German", spanish -> "Spanish"}]
how it works
Samples of texts in various languages are abundant on the web. Import French, English, German, and Spanish texts to use to train a classifier:
french = Import["http://fr.wikipedia.org/wiki/Main_Page"];
english = Import["http://en.wikipedia.org/wiki/Main_Page"];
german = Import["http://de.wikipedia.org/wiki/Main_Page"];
spanish = Import["http://es.wikipedia.org/wiki/Main_Page"];
This is what the beginning of the French text looks like:
StringTake[french, 150]
Make a classifier function using the training texts:
language =
Classify[{french -> "French", english -> "English",
german -> "German", spanish -> "Spanish"}]
Test the classifier on texts that were not in the training set:
language[ExampleData[{"Text", #}]] & /@
{"AliceInWonderland",
"LesFleursDuMal", "DonQuixoteISpanish", "UNHumanRightsGerman"}
Make a table of classified phrases:
{# , language[#]} & /@
{"the house is blue", "la maison est bleue",
"la casa es azul", "das Haus ist blau"} // TableForm