YakwaSI Annotated Aligned Corpus Tool
YakwaSI Help
Help | Options | Log In

Query Definition

The fundamental component of the YakwaSI system is the possibility to define a query for searching elements in a corpus. The language to define these queries is relatively simple. In principle, every line of the query defines a component, and all available components are presented in the tree next to the query box. But to get a more complete overview, here is the entire syntax:

Every line in the query box represents a component. The entire query is the sequence of all these components ordered from left to right. If you click on one of the categories in the tree, a new line will be created with the corresponding component. Every component consists of a component tag and the component content; a component tag always ends with a colon.

Every word in the corpus is Part-of-Speech tagged. So every word consists of three parts: orthography, base form, and part of speech category. Any of these can be used for a search query. Which of these three parts should be used for the query is indicated with the component tag: Word:, Lemma:, or Cat:. As an example, the query component Word: attribute searches the corpus for all occurrences of words having attribute as its orthography, and Cat: Noun searches the corpus for all nouns (i.e. for all words that are POS-tagged as nouns). Putting these two components on consecutive lines searches the corpus for all occurrences of attribute followed by a noun.

For words and lemmata, one can use wildcards. There are two kinds of wildcards: one-letter wildcards (?) and multi-letter wildcards (*). Two examples: Word: ma?e matches with male or make, made, etc. Lemma: ma*e matches with all words whose stem begins with ma and ends with an e. So all of the above, but also males, maize, and managing. POS tags are only represented as full descriptive names in the interface, but are translated to the tagset-specific codes behind the scenes. Therefore, no wildcards should be used in category descriptions.

Apart from wildcards within a word/lemma, it is also possible to have wild-card components. These are indicated by the component tag Jokers:. A wildcard component searches for (a maximum of) n arbitrary words. This component only makes sense when placed between two other components. So the sequence Word:when followed by Jokers: 4 and Cat:Adverb searches for all occurrences where the word when is followed within a window of 5 words (4 + the word itself) by an adverb. It is also possible to constrain the wildscards to a specific category: the component Jokers: Noun: 2 matches with up to the nouns.

For all three parts (words, lemmata, and categories), it is possible to specify option, by listing options separated by commas. As an example: the component Lemma: river, stream, tributary searches for all inflections of either river, stream, or tributary. It is not possible to have alternatives across different parts, so there is no way to specify in a query that a component should either be an adverb or the word maybe.

Apart from alternatives, it is also possible to combine restrictions in a single component, by placing a + between them. This is best made clear by examples: the component Word: hammer + Cat: Noun matches only with occurrences of hammer as a noun, and not with the present tense of the verb hammer, and Cat: Verb + Lemma: hammer matches with all inflections of the verb hammer and not with the noun. It is not possible to specify more than one restriction per part, so trying to give Word:re* + Word:*ion to find words starting with re- and ending with -ion does not work (the first restriction will be ignored). Also, combining anything with a wildcard components does not work. One can of course combine alternatives: Cat:Noun, Adjective + Word:assur* will give all nouns and adjectives starting with assur-.