Features for automatic discourse analysis of paragraphs

In this paper, we investigate which information is useful for the detection of rhetorical (RST) relations between (Multi-) Sentential Discourse Units ((M-)SDUs) - text spans consisting of one or more sentences - within the same paragraph. In order to do so, we simplified the task of discourse parsing to a decision problem in which we decided whether an (M-)SDU is rhetorically related to either a preceding or a following (M-)SDU. Employing the RST Treebank (Carlson et al. 2003), we offered this choice to machine learning algorithms together with syntactic, lexical, referential, discourse and surface features. Next, we determined which of the features were most useful for predicting the direction of the relation by ranking them on the basis of three different metrics. Highly ranked features that predict the presence of a rhetorical relation are syntactic similarity, word overlap, word similarity, continuous punctuation and many reference features. Other highly ranked features predict the absence of a relations (i.e. are used to introduce new topics or arguments): time references, proper nouns, definite articles, the word further and the verb bring.


Reference: Daphne Theijssen, Hans van Halteren, Suzan Verberne and Lou Boves (2008). Features for automatic discourse analysis of paragraphs. Suzan Verberne, Hans van Halteren and Peter-Arno Coppen (eds.), Computational Linguistics in the Netherlands 2007, pp. 53-68.
Paper (pdf; 170kB) ; BibTeX


back to publications