TFDS はCroissant 🥐 形式をサポートするようになりました。詳細については、ドキュメントをお読みください。

このページは Cloud Translation API によって翻訳されました。

Civil_comments

説明:

このバージョンの CivilComments データセットは、クラウドワーカーによって注釈が付けられた主要な 7 つのラベルへのアクセスを提供します。毒性およびその他のタグは 0 から 1 の間の値であり、これらの属性をコメントテキストに割り当てた注釈者の割合を示します。

他のタグは、入力例の一部でのみ使用できます。これらは現在、メインデータセットでは無視されます。 CivilCommentsIdentities セットにはこれらのラベルが含まれていますが、それらを含むデータのサブセットのみで構成されています。元の CivilComments リリースの一部であったその他の属性は、生データにのみ含まれています。利用可能な機能の詳細については、Kaggle のドキュメントを参照してください。

このデータセットのコメントは、独立したニュースサイトのコメントプラグインである Civil Comments プラットフォームのアーカイブから取得されます。これらのパブリックコメントは 2015 年から 2017 年にかけて作成され、世界中の約 50 の英語ニュースサイトに掲載されました。 Civil Comments が 2017 年に閉鎖されたとき、彼らは将来の研究を可能にするために、パブリックコメントを永続的なオープンアーカイブで利用できるようにすることを選択しました。 figshare で公開された元のデータには、パブリックコメントテキスト、記事 ID、出版物 ID、タイムスタンプ、コメント投稿者が生成した「市民性」ラベルなどの関連メタデータが含まれますが、ユーザー ID は含まれません。 Jigsaw は、このデータセットを拡張して、毒性、アイデンティティへの言及、および秘密の攻撃性に関する追加のラベルを追加しました。このデータセットは、Jigsaw Unintended Bias in Toxicity Classification Kaggle チャレンジでリリースされたデータの正確なレプリカです。このデータセットは、基になるコメントテキストと同様に、CC0 でリリースされます。

民事コメントデータにもparent_idを持つコメントの場合、前のコメントのテキストが「parent_text」機能として提供されます。分割はこの情報に関係なく行われたため、以前のコメントを使用すると一部の情報が漏洩する可能性があることに注意してください.注釈者は、ラベルを作成するときに親テキストにアクセスできませんでした。

ホームページ: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
ソースコード: tfds.text.CivilComments
バージョン:
- 1.0.0 : 最初の完全リリース。
- 1.0.1 : 各コメントに一意の ID を追加しました。
- 1.1.0 : CivilCommentsCovert 構成を追加しました。
- 1.1.1 : 正しいチェックサムで CivilCommentsCovert 構成を追加しました。
- 1.1.2 : CivilCommentsCovert データセットの別の引用を追加しました。
- 1.1.3 : id 型を float から string に修正しました。
- 1.2.0 : 有毒なスパン、コンテキスト、および親コメントテキスト機能を追加します。
- 1.2.1 : コンテキスト分割での不適切な書式設定を修正しました。
- 1.2.2 : 列車の分割のみを含むコンテキストを反映するように更新します。
- 1.2.3 : データの問題を修正するため、CivilCommentsCovert に警告を追加します。
- 1.2.4 (デフォルト): パブリケーション ID とコメントのタイムスタンプを追加します。
ダウンロードサイズ: 427.41 MiB
図( tfds.show_examples ): サポートされていません。

Civil_comments/CivilComments (デフォルト設定)

構成の説明: ここで設定された CivilComments にはすべてのデータが含まれますが、基本的な 7 つのラベル (毒性、深刻な毒性、わいせつ、脅威、侮辱、identity_attack、および性的明示) のみが含まれます。
データセットサイズ: 1.54 GiB
自動キャッシュ(ドキュメント): いいえ
スプリット:

スプリット	例
`'test'`	97,320
`'train'`	1,804,874
`'validation'`	97,320

機能構造:

FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
article_id	テンソル	int32
作成日	テンソル	弦
ID	テンソル	弦
アイデンティティーアタック	テンソル	float32
侮辱	テンソル	float32
わいせつな	テンソル	float32
親ID	テンソル	int32
親テキスト	文章	弦
出版物_id	テンソル	弦
深刻な毒性	テンソル	float32
性的な_露骨な	テンソル	float32
文章	文章	弦
脅威	テンソル	float32
毒性	テンソル	float32

監視されたキー( as_supervised docを参照): ('text', 'toxicity')
例( tfds.as_dataframe ):

引用：

@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Civil_comments/CivilCommentsIdentities

構成の説明: ここで設定された CivilCommentsIdentities には、基本的な 7 つのラベルに加えて、アイデンティティラベルの拡張セットが含まれています。ただし、これらすべての機能を備えたデータのサブセット (約 4 分の 1) のみが含まれます。
データセットのサイズ: 654.97 MiB
自動キャッシュ(ドキュメント): いいえ
スプリット:

スプリット	例
`'test'`	21,577
`'train'`	405,130
`'validation'`	21,293

機能構造:

FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'created_date': string,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
article_id	テンソル	int32
アジア人	テンソル	float32
無神論者	テンソル	float32
バイセクシャル	テンソル	float32
黒	テンソル	float32
仏教徒	テンソル	float32
キリスト教徒	テンソル	float32
作成日	テンソル	弦
女性	テンソル	float32
異性愛者	テンソル	float32
ヒンドゥー教	テンソル	float32
同性愛者_ゲイ_または_レズビアン	テンソル	float32
ID	テンソル	弦
アイデンティティーアタック	テンソル	float32
侮辱	テンソル	float32
知的または学習障害	テンソル	float32
ユダヤ人	テンソル	float32
ラテン系	テンソル	float32
男	テンソル	float32
イスラム教徒	テンソル	float32
わいせつな	テンソル	float32
その他の障害	テンソル	float32
other_gender	テンソル	float32
other_race_or_ethnicity	テンソル	float32
その他の宗教	テンソル	float32
その他の性的指向	テンソル	float32
親ID	テンソル	int32
親テキスト	文章	弦
身体障害	テンソル	float32
精神病または精神病	テンソル	float32
出版物_id	テンソル	弦
深刻な毒性	テンソル	float32
性的な_露骨な	テンソル	float32
文章	文章	弦
脅威	テンソル	float32
毒性	テンソル	float32
トランスジェンダー	テンソル	float32
白	テンソル	float32

監視されたキー( as_supervised docを参照): ('text', 'toxicity')
例( tfds.as_dataframe ):

引用：

@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Civil_comments/CivilCommentsCovert

構成の説明: 警告: CivilCommentsCovert には潜在的なデータ品質の問題があり、修正に積極的に取り組んでいます (06/28/22)。基になるデータが変更される可能性があります。

CivilCommentsCovert セットは、CivilCommentsIdentities のサブセットであり、トレーニングとテストの分割の約 20% に、毒性と ID ラベルに加えて、秘密の攻撃性についてさらに注釈が付けられています。評価者は、コメントを明示的、暗示的、ない、または攻撃的かどうかわからない、およびさまざまな種類の秘密の攻撃性が含まれているかどうかに分類するよう求められました。完全な注釈手順は、 https://sites.google.com/corp/view/hciandnlp/accepted-papersにある次の論文で詳しく説明されています。

データセットのサイズ: 97.83 MiB
自動キャッシュ(ドキュメント): はい
スプリット:

スプリット	例
`'test'`	2,455
`'train'`	48,074

機能構造:

FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'covert_emoticons_emojis': float32,
    'covert_humor': float32,
    'covert_masked_harm': float32,
    'covert_microaggression': float32,
    'covert_obfuscation': float32,
    'covert_political': float32,
    'covert_sarcasm': float32,
    'created_date': string,
    'explicitly_offensive': float32,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'implicitly_offensive': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'not_offensive': float32,
    'not_sure_offensive': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
article_id	テンソル	int32
アジア人	テンソル	float32
無神論者	テンソル	float32
バイセクシャル	テンソル	float32
黒	テンソル	float32
仏教徒	テンソル	float32
キリスト教徒	テンソル	float32
covert_emoticons_emojis	テンソル	float32
コバート・ユーモア	テンソル	float32
covert_masked_harm	テンソル	float32
隠密マイクロアグレッション	テンソル	float32
covert_obfuscation	テンソル	float32
隠密政治	テンソル	float32
秘密の皮肉	テンソル	float32
作成日	テンソル	弦
明示的に攻撃的	テンソル	float32
女性	テンソル	float32
異性愛者	テンソル	float32
ヒンドゥー教	テンソル	float32
同性愛者_ゲイ_または_レズビアン	テンソル	float32
ID	テンソル	弦
アイデンティティーアタック	テンソル	float32
暗黙的_攻撃的	テンソル	float32
侮辱	テンソル	float32
知的または学習障害	テンソル	float32
ユダヤ人	テンソル	float32
ラテン系	テンソル	float32
男	テンソル	float32
イスラム教徒	テンソル	float32
攻撃的ではない	テンソル	float32
not_sure_offensive	テンソル	float32
わいせつな	テンソル	float32
その他の障害	テンソル	float32
other_gender	テンソル	float32
other_race_or_ethnicity	テンソル	float32
その他の宗教	テンソル	float32
その他の性的指向	テンソル	float32
親ID	テンソル	int32
親テキスト	文章	弦
身体障害	テンソル	float32
精神病または精神病	テンソル	float32
出版物_id	テンソル	弦
深刻な毒性	テンソル	float32
性的な_露骨な	テンソル	float32
文章	文章	弦
脅威	テンソル	float32
毒性	テンソル	float32
トランスジェンダー	テンソル	float32
白	テンソル	float32

監視されたキー( as_supervised docを参照): ('text', 'toxicity')
例( tfds.as_dataframe ):

引用：

@inproceedings{lees-etal-2021-capturing,
    title = "Capturing Covertly Toxic Speech via Crowdsourcing",
    author = "Lees, Alyssa  and
      Borkan, Daniel  and
      Kivlichan, Ian  and
      Nario, Jorge  and
      Goyal, Tesh",
    booktitle = "Proceedings of the First Workshop on Bridging Human{--}Computer Interaction and Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.hcinlp-1.3",
    pages = "14--20"
}

Civil_comments/CivilCommentsToxicSpans

構成の説明: CivilComments の有毒なスパンは、スパンレベルでラベル付けされた CivilComments のサブセットです。大多数のアノテーターによって有毒であるとタグ付けされたすべての文字 (Unicode コードポイント) 境界のインデックスは、「スパン」機能で返されます。
データセットのサイズ: 5.81 MiB
自動キャッシュ(ドキュメント): はい
スプリット:

スプリット	例
`'test'`	2,000
`'train'`	7,939
`'validation'`	682

機能構造:

FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'spans': Tensor(shape=(None,), dtype=int32),
    'text': Text(shape=(), dtype=string),
})

機能のドキュメント:

特徴	クラス	形	Dtype
	特徴辞書
article_id	テンソル		int32
作成日	テンソル		弦
ID	テンソル		弦
親ID	テンソル		int32
親テキスト	文章		弦
出版物_id	テンソル		弦
スパン	テンソル	（なし、）	int32
文章	文章		弦

監視されたキー( as_supervised docを参照): ('text', 'spans')
例( tfds.as_dataframe ):

引用：

@inproceedings{pavlopoulos-etal-2021-semeval,
    title = "{S}em{E}val-2021 Task 5: Toxic Spans Detection",
    author = "Pavlopoulos, John  and Sorensen, Jeffrey  and Laugier, L{'e}o and Androutsopoulos, Ion",
    booktitle = "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.semeval-1.6",
    doi = "10.18653/v1/2021.semeval-1.6",
    pages = "59--69",
}

Civil_comments/CivilCommentsInContext

構成の説明: コンテキスト内の CivilComments は、ラベラーがparent_text を利用できるようにすることでラベル付けされた CivilComments のサブセットです。 contextual_toxicity 機能が含まれています。
データセットサイズ: 9.63 MiB
自動キャッシュ(ドキュメント): はい
スプリット:

スプリット	例
`'train'`	9,969

機能構造:

FeaturesDict({
    'article_id': int32,
    'contextual_toxicity': float32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})

機能のドキュメント:

特徴	クラス	Dtype
	特徴辞書
article_id	テンソル	int32
contextual_toxicity	テンソル	float32
作成日	テンソル	弦
ID	テンソル	弦
アイデンティティーアタック	テンソル	float32
侮辱	テンソル	float32
わいせつな	テンソル	float32
親ID	テンソル	int32
親テキスト	文章	弦
出版物_id	テンソル	弦
深刻な毒性	テンソル	float32
性的な_露骨な	テンソル	float32
文章	文章	弦
脅威	テンソル	float32
毒性	テンソル	float32

監視されたキー( as_supervised docを参照): ('text', 'toxicity')
例( tfds.as_dataframe ):

引用：

@misc{pavlopoulos2020toxicity,
    title={Toxicity Detection: Does Context Really Matter?},
    author={John Pavlopoulos and Jeffrey Sorensen and Lucas Dixon and Nithum Thain and Ion Androutsopoulos},
    year={2020}, eprint={2006.00998}, archivePrefix={arXiv}, primaryClass={cs.CL}
}