Project Spotlight: Zemberek- Turkish NLP and Turkish OpenOffice Spellchecker
Project Spotlight Interview with Ahmet Akin of project Zemberek.
Project Name: Zemberek
Summary: Turkish NLP library and Open Office Turkish Spellchecker Plugin
Owner Names: Ahmet Akin & Mehmet Dundar AKIN
City: Hato Rey, San Juan
Country: Puerto Rico
Tell us a little about yourself. I am 31 years old and originally an Electronics and Communications engineer. I worked in very different areas from Embedded system design to Technology Newspaper editor. I graduated and finished my master degree in Yildiz Technical University Electronics and Communications department in Turkey. I always had a weakpoint on software so changed focus to the higher level Software two years ago. Currently I am involving with Java development. My current employer is Softek Inc, I work as a developer there.
What schools/universities did you attend? Yildiz Technical University in Turkey and Istanbul Tecnical University in Turkey
Are you a member of any Java user groups? I moved to Puerto Rico two years ago, Java is not as popular as I want here and there is no local JUG. But I consider myself lucky because I'm involved with Java at work. I meet really great developers and my supervisor is a real Java guru (Victor Salaman) and I have learnt a lot from him. I still consider myself as a Java apprentice.
Tell us a little about the project and why you started it. Well, almost 5-6 years ago, I was interested in Mozilla project and I thought it would be cool to implement a real time spell checker for Turkish Language in it. Then I started to think how would it be and noticed that making a spell checker for Turkish is extremely hard. Truth is nothing like that is available in the open source area.
After I search about the subject, I became more interested in Natural Language Processing. I started a C++ project for the spell checker, and prototypes worked well. After 3 years and several changes in my life, with the help of my brother I decided to make the project alive again. But this time, we made a decision and rewrite the whole project in Java. It was a real breeze after C++. Seriously the difference in ease of development and deployment is huge, without sacrificing performance. We started a project in Java.net with the name of Tspell (the original name of the C++ project too).
Our scope was broader, we wanted to make a base for all kind of Turkish related computing and NLP problems. After almost one year, project was able to make Turkish spell checking, morfological extraction of the root and affixes of words, word suggestion for wrong words, and deasciifying texts written without using Turkish spesific characters. Then we changed the project name to Zemberek (Means main spring of clock) because "TSpell" was not Turkish and users did not like that. Now we also provide the first open source Turkish Spell checker for Open Office project and it works successfully. Zemberek is the only open source project in its area and we are proud of it. It bacame a part of the national Linux Project: Pardus ( http://uludag.org.tr/projeler/masaustu/zemberek-pardus/index.html ).
What is the project's current status and plans for the future? Although I still think that project is in its infancy, it is very active and usable for real life applications, Open Office plug-in is the proof of it. We also start developing a server project based on the core library. Server will hopefully provide language related services to other applications written in different languages, such as Mozilla and KDE. However, for us, there are a lot of work to do. Honestly right now Zemberek is still not doing serious "NLP" jobs. I can say it has a relatively simple structure and parsing mechanism is not really difficult.
After stabilizing the spell checker we will hopefully move on to more complicated and intresting subjects. Such as creating an open source wordnet for Turkish, sentence analysis, grammar checking, statistical analysis, maybe voice applications (TTS, Recognition, with the help of Free TTS and Sphynx4 libraries), translation, SQL with natural language, Shell commands with natural language, etc. Subjects in NLP are endless and when it is about Turkish there are very limited work available ( we know that in several universities in Turkey, there are advanced work available on the subject, but there are not many implemetations are available, especially in Java)
What kind of help are you looking for on this project? Of course, like all the other projects we are looking for developers. Currently three people are actively developing and it is really not enough. Unfortunately we cannot receive much help from international Java developers because of the nature of the project.
We are hoping that more help will come from Turkish Java developers. Knowledge related information is also crucial and project other members are helping. Turkish Linux communities helped a alot when we introduced Open Office plug in. Also we need linguists, experts in Turkish language and NLP.
Suggestions for GELC or Java.net It is great. I mean I really wish java.net would have started earlier.The services are improved nice and the projects in GELC are intresting. I know some NLP projects exist but since our main interestis Turkish I couldnt examine them in detail.
Suggestions, you should make yourselves more visible in educational environment. In schools MS is trying hard to lure the students, I think java.net, and Sun ingeneral should be doing this, because java's potential is greater.
If you have a project on Java.net and could deal with a little extra press, please contact me for a spotlight interview - Daniel Brookshier