Maarten Janssen - Home Page

Current and Previous Research

Corpus Linguistics

Currently, I am working at IULA in Barcelona, Spain on a number of projects related to corpus linguistics, with a focus on specialized languages and neologism.

Open Sources Lexical Information Network

Between May 2004 and May 2008 I was working at the ILTEC in Lisbon, Portugal on a number of related tools, which together are intended to form OSLIN: an open source lexical information network. The heart of this network is formed by MorDebe: a morphological database system, developed language independently, currently filled with over 125.000 portuguese lemmas, and around 1,5 million word-forms. MorDebe is used together with NeoTrack for the semi-automatic detection of neologisms in on-line newspapers.

Automatic extraction of semantic relations from corpora using linguistic markers

In the first two semesters of 2003 I was working at the ERSS in Toulouse, France in the field of computational terminology on the automatic extraction of semantic relations from corpora. The methodology used was first described by Hearst (1992): use patterns of text (often called linguistic markers) to find implicit and explicit mentionings of semantic relations in a text corpus. The basic idea is best made clear with a simple example: if a text contains the sentence This rod is best for greylings and other trouts., it implicitly claims that greylings are trouts. By finding all such implicitly expressed relations in a corpus, one can build a (partial) ontology.

As part of this project, I have written a multilingual concordancer for anotated copora (YakwaSI), based on the Yakwa concordancer of Ludovic Tanguy. YakwaSI can search aligned corpora for string of words, lemmata, and syntactic categories, based on a POS tagged corpus, using currently either Cordial or Treetagger.

The Application of Formal Concept Analysis to a Multilingual Lexical Database

The topic of my thesis was the application of FCA to Multilinugal Lexical Databases (see the SIMuLLDA home page). A brief description of its research question:

There are a lot of different bilingual dictionaries available in the world. Still, there is not a bilingual dictionary for every pair of languages. If you consider two `minor' languages like Malay Indonesian and Hungarian, there is a very slim chance that you will find a dictionary translating between these two languages. This is not surprising: there are several thousand different languages, so a full coverage would require many millions of bilingual dictionaries.

The way also have bilingual dictionaries between every pair of 'minor' languages, say Malay Indonesian and Hungarian, is not to create them by hand (since that would take way too much time), but to construe a Multilingual Lexical Database (MLLD), which contains many languages and which can be used to generate a bilingual dictionary between any pair of them. If all languages would use the same notions expressed by different words, such an MLLD would be hardly problematic. However, different languages often use different words, for instance because one language makes more subtile differences than another; whereas Hungarian has only one word for RICE (risz), Indonesian has four of them: padi for rice as it grows in the field, gabah for rice that has been harvested but not processed, beras for rice that has been husked and hulled, and nasi for cooked rice.

In order to get the MLLD to do what you want (automatically create bilingual dictionaries), you need to have a system that is powerful enough to deal with all these subtle differences. The purpose of this thesis is to test whether a logical framework called Formal Concept Analysis is powerful enough to function as the structural core of such an MLLD.