Asst.Prof.Dr. Gülşen Cebiroğlu Eryiğit

You are here: Gülşen Cebiroğlu Eryiğit's home page >pipeline.html

Turkish Natural Language Processing Pipeline

In this page, we introduce a Turkish nlp pipeline which may be used in order to obtain the syntactic and morphological analysis of raw Turkish sentences.

Last Update 07.01.2014

Please visit tools.nlp.itu.edu.tr for using ITU NLP tools (i.e. Normalizer, morphological analyzer, disambiguator, named entity recognizer, dependency parser for Turkish and many more ...)

!The tools provided below are no longer supported, they are all available as a SaaS from tools.nlp.itu.edu.tr

 

 

Copyright: Turkish NER Tagger tool by Gülşen ERYİĞİT is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Please see http://creativecommons.org/licenses/by-nc-sa/3.0/ for details.
You are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work

Under the following conditions:
Attribution
— You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
Noncommercial
— You may not use this work for commercial purposes.
Share Alike
— If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.

Attribution Info:
Please cite the following papers if you make use of this resource in your research:

Third Party Tools Used within the Turkish NLP pipeline (please refer to the individual copyrights before using them):

- Two level morphological analyzer by Prof.Dr. Kemal Oflazer (http://www.andrew.cmu.edu/user/ko/)

- Morphological Disambiguator by Haşim Sak, Murat Saraçlar and Tunga Güngör (http://www.cmpe.boun.edu.tr/~hasim/)

- Xerox Fst Tools (http://www.stanford.edu/~laurik/.book2software/)

- Maltparser (http://www.maltparser.org/)

 

 

The pipeline is mainly composed of the following three modules and the conversions codes (input-output format conversions, tag conversions, unknown word processor) between them:

1. A two-level morphological analyzer (Oflazer, 1994)
2. A perceptron based morphological disambiguator (Sak et al., 2008)
3. A data-driven dependency parser (Eryiğit et al., 2008; Nivre et al., 2007)

The pretrained dependency parsing model "modelturkishoriginal.mco" for Turkish comes together with the pipeline.

The below figure demonstrates a simple flow of the given pipeline (the collocation finder is not included.)

 

Download

Instructions

  1. Download the .rar file and extract it into a directory on your local machine. Create a java project (in eclipse) by using the pipeline code.
  2. Obtain the XEROX fst tools from http://www.stanford.edu/~laurik/.book2software/. Put the obtained files under the following directory: TurkishNLPPipeline_V0.1/Ofl2011. The directory should contain the below listed files after installation.

  3. Obtain the perceptron based morphological disambiguator from http://www.cmpe.boun.edu.tr/~hasim/download/MD-Release.zip, unzip the file under the following directory: TurkishNLPPipeline_V0.1/Hasim . The directory should contain the below listed files after installation.

  4. Download Maltparser v1.4 (upto1.5.3), and follow the instructions from http://www.maltparser.org in order to install the program. Add the following jar files into your project buildpath.

5. run pipeline.java. Hopefully it will work :)

For now, it is only tested on Windows but it should work on Linux as well with some modifications. (We have tested with our previous versions. I'll try to put the linux version as soon as possible.)

The input file should be encoded as UTF-8 and be of the following format: one word/token at each line, each sentence seperated by "*****" 5 stars. A sample input file "input.txt" is given in the zip file.

Please send an email to gulsenc@itu.edu.tr for further request.

 

 

 

.