Warning: this page may contain offensive language samples in Turkish
This page describes the annotation of a Turkish offensive language corpus. The corpus consist of randomly sampled tweets, and annotated in a similar way to OffensEval and GermEval.
For more details, see,
The data
We distribute the data in a few alternative formats.
- troff-v1.0.tsv.gz contains the full data set as
described in the paper. The file is formatted as TSV
file, with four fields, namely,
id
,timestamp
,text
, andlabel
. Note that this file uses quoting and preserves newlines in the original Tweets. Here the labeling is “flat”:- non not offensive
- prof profanity, or non-targeted offense
- grp offense towards a group
- indv offense towards an individual
- oth offense towards an other (non-human) entity, often an event or organization
- offenseval2020-turkish.zip contains the data as used in OffensEval 2020 shared task. Please see the enclosed README file and official OffensEval web page for further information.
The annotations are distributed under the terms of Creative Commons Attribution License (CC-BY). Please cite the following paper, if you use this resource.
@inproceedings{coltekin2020lrec,
author = {\c{C}\"{o}ltekin, \c{C}a\u{g}r{\i}},
year = {2020},
title = {A Corpus of Turkish Offensive Language on Social Media},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
pages = {6174--6184},
address = {Marseille, France},
url = {https://www.aclweb.org/anthology/2020.lrec-1.758},
}