Researchers are beginning to explore how to generate summaries of extended argumentative conversations in social media, such as those found in reader comments in online news. To date, however, there has been little discussion of what these summaries should be like and a lack of human-authored exemplars, quite likely because writing summaries of this kind of interchange is so difficult.
In this work we proposed one type of reader comment summary - the conversation overview summary – that aims to capture the key argumentative content of a reader comment conversation.
Based on a method we have developed to support humans in authoring conversation overview summaries, we have created a publicly available corpus – the first of its kind – of news articles plus comment sets, each multiply annotated, according to our method, with conversation overview summaries.
The SENSEI Annotated Corpus comprises annotations of 18 articles and reader comments from The Guardian; 15 articles were double annotated and 3 articles were triple annotated.
This results in a total of 39 sets of complete annotations. The evaluation guideline (PDF, 575KB) given to the annotators is available for download.
More information about the method to create this corpus has been described in Barker et al. (2016). Please cite this paper if you use the SENSEI annotated corpus in your work. Full details are given in the reference publication section below.
The corpus contains a file: "listOfFiles.txt" and three folders: "Articles", "Comments" and "Annotations". We describe each of these formats below.
First, the "listOfFiles.txt" shows the article and comments that are related to each annotations in a tab-separated format. An excerpt of this file is shown below:
The corpus contains annotations of 18 articles from The Guardian. Our articles page contains a list of the original URLs for all articles (and their corresponding reader comments). The "Articles" folder contains the plain-text version of these 18 articles.
An excerpt is shown below:
The "Comments" folder contain 18 files, each of which lists the reader comments of an article. Each file lists the reader comments for an article, and also their metadata information (such as comment Id, their reply-to comment Id if available, number of recommendations, the authors, and the times they were written) in a tab-separated format.
An excerpt is shown below:
The "Annotations" folder contains 39 files, i.e. 39 sets of annotations for the 18 sets of reader comments. Each annotation contains information from each of the four summary-writing stages:
Stage 1: Comment labeling
In this stage, annotators were asked to write a 'label' for each comment, which is a short, free text annotation, capturing its essential content. A label should provide a summary of the main "points, arguments or propositions" expressed in a comment. An excerpt of the comment label information is shown below:
Stage 2: Label grouping
In stage 2, annotators were asked to group together labels (from Stage 1) which were similar or related. Annotators then provided a "Group Label" to describe the common theme of the group in terms of, eg topic, propositions, contradicting viewpoints, humour, etc. Annotators may also split the labels in a group into "Sub-Groups" and assign a "SubGroup Label".
Alternatively, annotators may use sub-groups when organising similar comments, such as the example below:
Stage 3: Summary generation
In Stage 3, annotators wrote summaries based on their label-grouping analysis. Annotators first wrote an 'unconstrained summary', with no word-length requirement, and then wrote a 'constrained-length summary' of 150–250 words.
We show excerpts of both summaries below:
Stage 4: Back-linking
In this stage, annotators linked each sentence in the constrained-length summary back to the groups (or sub-groups) that informed their creation. Such links imply that at least some of the labels in a group (or sub-group) played a part supporting the sentence.
This information is shown below:
SENSEI annotated corpus (zip folder)
By accessing this folder you agree not to redistribute any of the content within.
Please cite the following publication if you are using the SENSEI Annotated Corpus:
Barker, E., Paramita, M., Aker, A., Kurtic, E., Hepple, M., and Gaizauskas, R. 2016. The SENSEI annotated corpus: human summaries of reader comment conversations in on-line news. In: Proceedings of the 17th annual SIGdial meeting on discourse and dialogue (SIGdial 2016), Los Angeles, USA.
We would like to thank The Guardian for giving us permission to use their data and the European Commission for supporting this work, carried out as part of the FP7 SENSEI project, grant reference: FP7-ICT-610916. We would also like to thank all the annotators without whose work the SENSEI corpus would not have been created.
If you have any questions about the corpus, please contact Emma Barker (e.barker@sheffield.ac.uk) or Monica Paramita (m.paramita@sheffield.ac.uk).