Harvesting big text data for under-resourced languages

Dept 58 - International Relations Department
Dept 58 - International Relations Department

Published

Updated 23-12-2014
  • region changed

Project
RegionNational coverage
Title of the ProgrammeCzech-Norwegian Research Programme
Title of the ProjectHarvesting big text data for under-resourced languages
Number of the Project7F14047
Project Promoter

Masaryk University, Brno

www.muni.cz

Name of Norwegian Partner(s)

Norges teknisk-naturvitenskapelige universitet

Objective of the Project

The main goal of the project is to harvest large-scale textual data from the Web for under-resourced languages (Norwegian, partly Czech and the four major languages in Ethiopia -    Amharic, Afaan Oromo, Tigrinya, Somali) and      to build shallow processing applications for them. The data     will be annotated and parsed to make it usable in various language processing applications, such as information   extraction and retrieval, machine translation, etc. The     project results will be utilized also within external   cooperation with the niversity of Oslo and two Ethiopian     universities in a project to support linguistic esource building in Ethiopia funded by Norad. The developed applications will serve for investigating and separating multiple senses of the words in the corpora, for word sense induction, as well as for creating multi-sense vector spaces and parallel multilingual vector spaces for word translation    disambiguation.

Approved grant

923 321 EUR

Project DurationStart date: 15th July 2014, End date: 30th April 2017