Master Thesis

Features for Automatic Discourse Analysis of Paragraphs
Finding features to detect rhetorical relations between sentences within paragraphs


Although discourse analysis is considered useful for many applications in the field of language technology, automatic discourse parsing is still problematic. A widely accepted model for discourse analysis is Rhetorical Structure Theory, developed by Mann and Thompson (1988). Soricut and Marcu (2003) have developed the discourse parser SPADE, which automatically detects RST-relations between Elementary Discourse Units (EDUs) within a sentence. An automatic discourse parser that is able to nd rhetorical relations at higher levels in the text is not yet available.

This thesis focusses on the rhetorical relations between (Multi-)Sentential Discourse Units ((M-)SDUs) - text spans consisting of one or more sentences - within the same paragraph of an English text. The goal of the research is to establish what information is useful in detecting these relations. To achieve this, potentially relevant features have been derived from literature on existing systems for discourse analysis and from a short study of a subset of the RST Discourse Treebank (Carlson et al. 2003). Next, most features have been made concrete in such a way that they could be extracted automatically. This was not possible for some features, e.g. ellipsis, newspaper style and world knowledge.We developed a metric for syntactic similarity, and introduced the feature NP simplification.

After all feature values were automatically extracted, they were offered to various machine learning algorithms. We have simplified the task of discourse parsing to a decision problem in which we decide whether an (M-)SDU is rhetorically related to either an immediately preceding or following (M-)SDU. This task was presented to machine learning algorithms together with the syntactic, lexical, referential, discourse and surface features in order to determine which features are most useful. Since developing an algorithm ourselves was beyond the scope of this thesis, we decided to experiment with existing implementations and thus without having full knowledge of their suitability given the task and data.

The performance of the classification algorithms was disappointing: Only the models of Naive Bayes (Demsar et al. 2004) and Maximum Entropy (Zhang 2004) reached significant improvement over the baseline of selecting the most common direction (right). Causes may be the small data set, the large number of features, the parameter settings, the artificiality of the task and/or the extent to which the algorithms are able to deal with the type of information provided. Assuming that the two algorithms are able to sift the information with some success and that the sifting is expressed in the model parameters, we have developed methods to rank the features according to their relevance on the basis of these parameters. This was also performed for the feature selection algorithms Relief (Kononenko 1994) and CSS (van Halteren, personal communication). From the four rankings based on the separate algorithms, a final ranked list was created. An in-depth study of the suitability of the algorithms and our methods is not included in this thesis, thus we must advise other researchers caution in taking the results described below for granted.

As mentioned above, we have included five different feature types: surface, syntactic, lexical, reference and discourse. The most relevant surface features concern text characteristics that have also been covered by the other (more sophisticated) feature types. Syntax appears to be useful for the detection of rhetorical relations: the higher the syntactic similarity, the more chance the (M-)SDUs in question are rhetorically related. Lexical cues, which have been used by all researchers in the field of automatic RST annotation, are also beneficial in our task. A high word overlap or word similarity often means there is a rhetorical relation. As expected, reference features are also very useful: the presence of anaphora, personal pronouns, demonstrative pronouns, reference words and missing modifiers are cues that a rhetorical relation is present. Discourse structure also helps in finding rhetorical relations, perhaps due to the rather common newspaper style (with right-skewedness within paragraphs). The presence of direct speech (indicated by quotation marks) also predicts the presence of a rhetorical relation. Our data and method indicate that feature values should be based on the full (M-)SDU as well as the Nucleus sentence in the future.

My thesis, the feature values, the Perl scripts and the feature relevance scores found can be downloaded below. Since the source data is licensed (RST Discourse Treebank), it is not possible to include them on the website.

Read my thesis (pdf; 706kB)

Download the data