Skip to main content

zemberek - Turkish NLP library and Turkish Spell Checking

Posted by turbogeek on March 6, 2005 at 3:45 PM PST

Zemberek is one of the very interesting projects oriented around the Turkish language and subjects like Natural Language Processing and even spell checking for Turkish. This open source project is something we really like seeing here within the Global Education and Learning Community (GELC) because of its international base. I talked to the owners about themselves and their project.

Tell us about yourself

Ahmet Akin: I am an Electronics and Communications engineer. I worked in very different areas, in brief and historical order: Modem manufacturing and test (boring), Embedded system design with microcontrollers and early Smart cards, CAD software development in C++, Technology Newspaper editor, Long time embedded hardware and software development with C, Cryptology, web based design with php :) and at last, my long time dream, Java development. My current emplyer is Softek Inc ( www.softekpr.com ), I work as a researcher there. I do like java programming, photograpy, chess and table tennis. The other owner of the project is my twin brother, he graduated from the same university but Computer Engineering department. We worked in the same place for almost 5 years, but his job is mostly software related. He is currently working in National Research Institute of Electronics and Cryptology. I graduated from Yildiz Technical University Electronics and Communication Engineering in Turkey. I finished my master degree in the same university Communications department. I always had a weak point on software, so I started my doctoral study in Istanbul Technical University Computer Engineering with the purpose of working on Natural Language Processing subject.

Co-Owner: Mehmet Dundar Akin, has a BS degree from Yildiz Technical University and MS degree from Istanbul Technical University in computer Engineering. Currently he is working as senior researcher for National Research Institute of Electronics and Cryptography in Turkey.

What local Java user group are you associated with if any?

I moved to Puerto Rico one and a half year ago, and sadly Java is not so popular here in business environment (MS - Visual Basic Island). But I consider myself lucky because I use Java at work, I met Java programmers and my supervisor is a real Java guru (Victor Salaman) and I have learnt a lot from him. I still consider myself as a Java apprentice (my code says the same :)

Why did you start this project and what is it about?

Well, almost five years ago, I was interested in Mozilla project and I thought it would be cool to implement a real time spell checker for Turkish Language in it. Then I started to think how would it be and noticed that making a spell checker for Turkish is extremely hard. The deeper I go into the subject the more interested in Natural Language Processing subject I became. I started a C++ project for the spell checker, and my first prototype miraculously worked. After 3 years and several changes in my life, with the help of my brother (who is a good Computer Engneer with very good Java knowledge) I decided to make the project alive again. But this time, we made a decision and rewrite the whole project in Java. It was a real breeze after C++. Seriously the difference in ease of development and deployment is huge, without sacrificing performance.

We started a project in Java.net with the name of Tspell (the original name of the C++ project too). Our scope was broader, we wanted to make a base for all kind of Turkish related computing and NLP problems. After almost one year, project was able to make Turkish spell checking, morphological extraction of the root and affixes of words, word suggestion for wrong word, and deasciifying texts written without using Turkish specific characters. Then we changed the project name to Zemberek (Means main spring of clock) because "TSpell" was not Turkish and users did not like that. Now we also provide the first open source Turkish Spell checker for Open Office project and it works successfuly.

Zemberek is the only open source project in its area and we are proud of it. It bacame a part of the first product of a Turkish national Linux Project: Pardus (http://www.uludag.org.tr/ ). We will also made a presentation in a very important event, the Open Source Days in Istanbul BilgI University about the project. (http://open.bilgi.edu.tr/freedays/program.php?lang=en )

What is the status and further plans for this project?

Although I still see the project in its infancy, project is very active and it is almost usable for real life applications, Open Office plug-in is the proof of it. We also start developing a server project based on the core library. Server will hopefully provide language related services to other applications, such as Mozilla and KDE. However, for us, there are a lot of work to do. Honestly right now Zemberek is still not doing serious "NLP" jobs. I can say it has a relatively simple structure and parsing mechanism is not really difficult. But after stabilizing the spell checker we will hopefully move on to more complicated and intresting subjects. Such as creating an open source wordnet for Turkish, sentence analysis, grammar checking, statistical analysis, maybe voice applications (TTS, Recognition, with the help of Free TTS and Sphynx4 libraries), translation, SQL with natural language, Shell commands with natural language, etc. Subjects in NLP are endless and when it is about Turkish there are very limited work available ( we know that in several universities in Turkey, there are advanced work available on the subject, but there are not many implemetation is available, especially in Java)

What kind of help are you looking for?

Of course, like all the other projects we are looking for developers. Currently only two people are actively developing and it is really not enough. Unfortunately we cannot receive much help from international Java developers because of the nature of the project. So we are hoping that more help will come from Turkish Java developers. Knowledge related information is also crucial and project other members are helping. Also we need linguists, experts in Turkish language and general Language subject. NLP expertise is another .Turkish Linux communities helped a alot when we introduced Open Office plug in.

Where are you located?

Amet lives in Puerto Rico in the rural area near the city of Canovanas. Mehmet Dundar Akin lives in Turkey.

This is for us to do a better job: What do you think about the GELC and the java.net community, any suggestions?

The GELC and java.net is great. I mean I really wish java.net would have started earlier. The services are improved nice and the projects in GELC are interesting. I know some NLP projects exist but since our main interest is Turkish I couldnt examine them in detail. Suggestions, you should make yourselves more visible in educational environment. In schools MS is trying hard to lure the students, I think java.net, and Sun in general should be doing this, because java's potential is much better. Also maybe contest like events can be created.

Related Topics >>