SENSEI social media annotated corpus

Outline

Introduction

Corpus format

List of files

Reference publication

Acknowledgment

Introduction

Researchers are beginning to explore how to generate summaries of extended argumentative conversations in social media, such as those found in reader comments in online news. To date, however, there has been little discussion of what these summaries should be like and a lack of human-authored exemplars, quite likely because writing summaries of this kind of interchange is so difficult.

In this work we proposed one type of reader comment summary - the conversation overview summary – that aims to capture the key argumentative content of a reader comment conversation.

Based on a method we have developed to support humans in authoring conversation overview summaries, we have created a publicly available corpus – the first of its kind – of news articles plus comment sets, each multiply annotated, according to our method, with conversation overview summaries.

Corpus format

The SENSEI Annotated Corpus comprises annotations of 18 articles and reader comments from The Guardian; 15 articles were double annotated and 3 articles were triple annotated.

This results in a total of 39 sets of complete annotations. The evaluation guideline (PDF, 575KB) given to the annotators is available for download.

More information about the method to create this corpus has been described in Barker et al. (2016). Please cite this paper if you use the SENSEI annotated corpus in your work. Full details are given in the reference publication section below.

The corpus contains a file: "listOfFiles.txt" and three folders: "Articles", "Comments" and "Annotations". We describe each of these formats below.

List of files

First, the "listOfFiles.txt" shows the article and comments that are related to each annotations in a tab-separated format. An excerpt of this file is shown below:

Articles

The corpus contains annotations of 18 articles from The Guardian. Our articles page contains a list of the original URLs for all articles (and their corresponding reader comments). The "Articles" folder contains the plain-text version of these 18 articles.

An excerpt is shown below:

Comments

The "Comments" folder contain 18 files, each of which lists the reader comments of an article. Each file lists the reader comments for an article, and also their metadata information (such as comment Id, their reply-to comment Id if available, number of recommendations, the authors, and the times they were written) in a tab-separated format.

An excerpt is shown below:

Annotations

The "Annotations" folder contains 39 files, i.e. 39 sets of annotations for the 18 sets of reader comments. Each annotation contains information from each of the four summary-writing stages:

Stage 1: Comment labeling
In this stage, annotators were asked to write a 'label' for each comment, which is a short, free text annotation, capturing its essential content. A label should provide a summary of the main "points, arguments or propositions" expressed in a comment. An excerpt of the comment label information is shown below:

LABELS
Label for comment 1: NR - ticket prices (implies too high)Label for comment 2: NR do not set fares / operate trainsLabel for comment 3: NR do not set fares / operate trainsLabel for comment 4: NR do not set fares - acceptedLabel for comment 5: non-topic; calm downLabel for comment 6: Ticket prices set by government; NR not-for-profit; system not transparentLabel for comment 7: non-topic; jokeLabel for comment 8: NR do not set faresLabel for comment 9: NR do not set fares. TOCs set fares, under government restrictions.Label for comment 10: Privatised rail system 'demented;...

Stage 2: Label grouping
In stage 2, annotators were asked to group together labels (from Stage 1) which were similar or related. Annotators then provided a "Group Label" to describe the common theme of the group in terms of, eg topic, propositions, contradicting viewpoints, humour, etc. Annotators may also split the labels in a group into "Sub-Groups" and assign a "SubGroup Label".

GROUPS
GROUP: NR do not set fares / operate trainsLabel for comment 2 [Cynic24]: NR do not set fares / operate trainsLabel for comment 3 [Cynic24]: NR do not set fares / operate trainsLabel for comment 4 [lindalusardi]: NR do not set fares - acceptedLabel for comment 6 [Craig Axon]: Ticket prices set by government; NR not-for-profit; system not transparentLabel for comment 8 [PrimitivePerson]: NR do not set fares...
GROUP: NR is non-profit / "not for dividend"Label for comment 6 [Craig Axon]: Ticket prices set by government; NR not-for-profit; system not transparentLabel for comment 34 [Cynic24]: NR non-profit. Income ploughed back, or to government.Label for comment 35 [C2H4n]: NR "not for dividend" private company....

Alternatively, annotators may use sub-groups when organising similar comments, such as the example below:

GROUPS
GROUP: Does fining a publicly owned body like NR make sense? (26 commenters); NR is publicly owned so fining it just takes away money it needs which it must get back from public subsidy.; Dock directors pay insteadLabel for comment 19 [KeithClan]: How is fining NR going to help: NR is owned by the taxpayers; how is fining it going to help improve the railways? Suggests nationalisation is the answer and aligning fares with Europe.Label for comment 20 [Deviant]: How is fining NR going to help: agrees with previous. Suggest TOCS are to blame too; suggests disaster recovery planning.Label for comment 21 [hosede88]: How is fining NR going to help: agrees no point -- should fine directors instead....
SUBGROUP: Dock Directors Pay InsteadLabel for comment 21 [hosede88]: How is fining NR going to help: agrees no point -- should fine directors instead.Label for comment 23 [KallisteHill]: How is fining NR going to help: Dock pay of directors insteadLabel for comment 45 [MIAsin]: How is fining NR going to help: punish management (sack, withhold bonuses), but not point fining public company, which is just punishing consumer again.Label for comment 46 [ElChisellero]: Part of bonuses dependent on train performance have been withheld.Label for comment 90 [gristsparger]: Why not fine the weather?...
SUBGROUP: Nationalisation Is/Is not the answerLabel for comment 19 [KeithClan]: How is fining NR going to help: NR is owned by the taxpayers; how is fining it going to help improve the railways? Suggests nationalisation is the answer and aligning fares with Europe.Label for comment 22 [thomo21]: Nationalisation not the answer: NR more or less fully nationalised already. Just because someone else owns it won't make it work.Label for comment 28 [printerink]: Nationalisation not the answer: nationalisation has not worked at Stafford Hospital and BBC has not stopped deaths/Savile....

Stage 3: Summary generation
In Stage 3, annotators wrote summaries based on their label-grouping analysis. Annotators first wrote an 'unconstrained summary', with no word-length requirement, and then wrote a 'constrained-length summary' of 150–250 words.

We show excerpts of both summaries below:

UNCONSTRAINED SUMMARY
Several commenters thought that fining Network Rail was either meaningless or counter-productive. They argued that it is really a fine on the taxpayer, as the company is publicly funded.
A number argued that it would make more sense to fine the directors instead, or even sack them. It was pointed out that their bonuses would be cut for poor performance, but several thought that their bonuses would still be too large. It was joked that the directors were on a "gravy train".
There was some debate as to whether Network Rail should be held responsible for delays, or whether they might be the fault of the train operating companies or the weather. A joke was made suggesting that the train operating companies don't care about delays, as the regulatory system is ineffective....

CONSTRAINED SUMMARY
Several commenters thought fining Network Rail was meaningless or counter-productive, as it is publicly funded, making it really a fine on the taxpayer.
A number argued the directors should be fined instead, or sacked. One commenter said directors' bonuses would be cut for poor performance, but several thought their bonuses would still be too large. A joke suggested the directors were on a "gravy train".
Whether Network Rail was to blame for delays, rather than the train companies, was discussed. A joke suggested the train companies don't care about delays, as regulation is ineffective....

Stage 4: Back-linking
In this stage, annotators linked each sentence in the constrained-length summary back to the groups (or sub-groups) that informed their creation. Such links imply that at least some of the labels in a group (or sub-group) played a part supporting the sentence.

This information is shown below:

BACK LINKING
Several commenters thought fining Network Rail was meaningless or counter-productive, as it is publicly funded, making it really a fine on the taxpayer.(Sub)GroupNames:- GROUP: Fine meaningless, as company publicly owned. Fine directors instead??
A number argued the directors should be fined instead, or sacked.(Sub)GroupNames:- GROUP: Fine/punish directors instead??
One commenter said directors' bonuses would be cut for poor performance, but several thought their bonuses would still be too large.(Sub)GroupNames:- GROUP: Fine/punish directors instead??
A joke suggested the directors were on a "gravy train".(Sub)GroupNames:- GROUP: Fine/punish directors instead??
Whether Network Rail was to blame for delays, rather than the train companies, was discussed.(Sub)GroupNames:- GROUP: who responsible for delays / poor service - TOCs or not?
A joke suggested the train companies don't care about delays, as regulation is ineffective.(Sub)GroupNames:- GROUP: who responsible for delays / poor service - TOCs or not?
Several commenters thought the current privatised rail system doesn't work and should be nationalised.(Sub)GroupNames:- GROUP: privatisation / nationalisation...

Download

SENSEI annotated corpus (zip folder)
By accessing this folder you agree not to redistribute any of the content within.

Reference publication

Please cite the following publication if you are using the SENSEI Annotated Corpus:

Barker, E., Paramita, M., Aker, A., Kurtic, E., Hepple, M., and Gaizauskas, R. 2016. The SENSEI annotated corpus: human summaries of reader comment conversations in on-line news. In: Proceedings of the 17th annual SIGdial meeting on discourse and dialogue (SIGdial 2016), Los Angeles, USA.

Acknowledgment

We would like to thank The Guardian for giving us permission to use their data and the European Commission for supporting this work, carried out as part of the FP7 SENSEI project, grant reference: FP7-ICT-610916. We would also like to thank all the annotators without whose work the SENSEI corpus would not have been created.

Contact us

If you have any questions about the corpus, please contact Emma Barker (e.barker@sheffield.ac.uk) or Monica Paramita (m.paramita@sheffield.ac.uk).

Page updated

Report abuse