TFDS는 이제 Croissant 🥐 형식을 지원합니다! 자세한 내용은 설명서를 읽어보세요.

이 페이지는 Cloud Translation API를 통해 번역되었습니다.

Civil_comments

설명 :

이 버전의 CivilComments 데이터 세트는 크라우드 작업자가 주석을 추가한 기본 7개 레이블에 대한 액세스를 제공합니다. 독성 및 기타 태그는 주석 텍스트에 이러한 속성을 할당한 주석 작성자의 비율을 나타내는 0과 1 사이의 값입니다.

다른 태그는 입력 예제의 일부에만 사용할 수 있습니다. 현재 기본 데이터 세트에서는 무시됩니다. CivilCommentsIdentities 세트에는 이러한 레이블이 포함되지만 해당 레이블이 있는 데이터의 하위 집합으로만 구성됩니다. 원래 CivilComments 릴리스의 일부였던 다른 속성은 원시 데이터에만 포함됩니다. 사용 가능한 기능에 대한 자세한 내용은 Kaggle 설명서를 참조하십시오.

이 데이터 세트의 댓글은 독립 뉴스 사이트용 댓글 플러그인인 Civil Comments 플랫폼의 아카이브에서 가져왔습니다. 이러한 공개 댓글은 2015년부터 2017년까지 작성되었으며 전 세계 약 50개의 영어 뉴스 사이트에 게재되었습니다. Civil Comments가 2017년에 종료되었을 때 그들은 향후 연구를 가능하게 하기 위해 지속적인 공개 아카이브에서 공개 의견을 사용할 수 있도록 선택했습니다. figshare에 게시된 원본 데이터에는 공개 댓글 텍스트, 기사 ID, 게시 ID, 타임스탬프 및 댓글 작성자가 생성한 "시민성" 레이블과 같은 일부 관련 메타데이터가 포함되지만 사용자 ID는 포함되지 않습니다. Jigsaw는 독성, 신원 언급 및 은밀한 공격에 대한 추가 레이블을 추가하여 이 데이터 세트를 확장했습니다. 이 데이터 세트는 Jigsaw Unintended Bias in Toxicity Classification Kaggle 챌린지에 대해 공개된 데이터의 정확한 복제본입니다. 이 데이터 세트는 기본 주석 텍스트와 마찬가지로 CC0에서 릴리스됩니다.

시민 댓글 데이터에도 parent_id가 있는 댓글의 경우 이전 댓글의 텍스트를 "parent_text" 기능으로 제공합니다. 이 정보를 고려하지 않고 분할이 이루어졌으므로 이전 주석을 사용하면 일부 정보가 누출될 수 있습니다. 주석 작성자는 레이블을 만들 때 상위 텍스트에 액세스할 수 없었습니다.

홈페이지 : https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data
소스 코드 : tfds.text.CivilComments
버전 :
- 1.0.0 : 초기 정식 릴리스.
- 1.0.1 : 각 댓글에 고유 ID를 추가했습니다.
- 1.1.0 : CivilCommentsCovert 구성이 추가되었습니다.
- 1.1.1 : 올바른 체크섬이 있는 CivilCommentsCovert 구성이 추가되었습니다.
- 1.1.2 : CivilCommentsCovert 데이터 세트에 대한 별도의 인용이 추가되었습니다.
- 1.1.3 : float에서 string으로 id 유형을 수정했습니다.
- 1.2.0 : 독성 스팬, 컨텍스트 및 상위 댓글 텍스트 기능을 추가합니다.
- 1.2.1 : 컨텍스트 분할에서 잘못된 서식을 수정합니다.
- 1.2.2 : 기차 분할만 있는 컨텍스트를 반영하도록 업데이트합니다.
- 1.2.3 : 데이터 문제를 수정하면서 CivilCommentsCovert에 경고를 추가합니다.
- 1.2.4 (기본값): 게시 ID 및 댓글 타임스탬프를 추가합니다.
다운로드 크기 : 427.41 MiB
그림 ( tfds.show_examples ): 지원되지 않습니다.

civil_comments/CivilComments(기본 구성)

구성 설명 : 여기에 설정된 CivilComments에는 모든 데이터가 포함되지만 기본 7개 레이블(독성, 심한_독성, 음란, 위협, 모욕, 정체성_공격 및 성적인_노골적)만 포함됩니다.
데이터세트 크기 : 1.54 GiB
자동 캐시 ( 문서 ): 아니요
분할 :

나뉘다	예
`'test'`	97,320
`'train'`	1,804,874
`'validation'`	97,320

기능 구조 :

FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})

기능 문서 :

특징	수업	D타입
	풍모Dict
기사 ID	텐서	int32
만든 날짜	텐서	끈
ID	텐서	끈
신원 공격	텐서	float32
모욕	텐서	float32
역겨운	텐서	float32
parent_id	텐서	int32
parent_text	텍스트	끈
publication_id	텐서	끈
심한 독성	텐서	float32
성적_노골적	텐서	float32
텍스트	텍스트	끈
위협	텐서	float32
독성	텐서	float32

감독 키 ( as_supervised 문서 참조): ('text', 'toxicity')
예 ( tfds.as_dataframe ):

인용 :

@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

civil_comments/CivilCommentsIdentities

구성 설명 : 여기에 설정된 CivilCommentsIdentities에는 기본 7개 레이블 외에 확장된 ID 레이블 세트가 포함되어 있습니다. 그러나 여기에는 이러한 모든 기능이 있는 데이터의 하위 집합(약 1/4)만 포함됩니다.
데이터 세트 크기 : 654.97 MiB
자동 캐시 ( 문서 ): 아니요
분할 :

나뉘다	예
`'test'`	21,577
`'train'`	405,130
`'validation'`	21,293

기능 구조 :

FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'created_date': string,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})

기능 문서 :

특징	수업	D타입
	풍모Dict
기사 ID	텐서	int32
아시아 사람	텐서	float32
무신론자	텐서	float32
양성애자	텐서	float32
검은색	텐서	float32
불교	텐서	float32
신자	텐서	float32
만든 날짜	텐서	끈
여성	텐서	float32
이성애자	텐서	float32
힌두 인	텐서	float32
homosexual_gay_or_lesbian	텐서	float32
ID	텐서	끈
신원 공격	텐서	float32
모욕	텐서	float32
지적_or_학습_장애	텐서	float32
유태인	텐서	float32
라틴계	텐서	float32
남성	텐서	float32
이슬람교도	텐서	float32
역겨운	텐서	float32
other_disability	텐서	float32
other_gender	텐서	float32
other_race_or_ethnicity	텐서	float32
other_religion	텐서	float32
other_sexual_orientation	텐서	float32
parent_id	텐서	int32
parent_text	텍스트	끈
신체 장애	텐서	float32
정신과 또는 정신 질환	텐서	float32
publication_id	텐서	끈
심한 독성	텐서	float32
성적_노골적	텐서	float32
텍스트	텍스트	끈
위협	텐서	float32
독성	텐서	float32
트랜스 젠더	텐서	float32
하얀색	텐서	float32

감독 키 ( as_supervised 문서 참조): ('text', 'toxicity')
예 ( tfds.as_dataframe ):

인용 :

@article{DBLP:journals/corr/abs-1903-04561,
  author    = {Daniel Borkan and
               Lucas Dixon and
               Jeffrey Sorensen and
               Nithum Thain and
               Lucy Vasserman},
  title     = {Nuanced Metrics for Measuring Unintended Bias with Real Data for Text
               Classification},
  journal   = {CoRR},
  volume    = {abs/1903.04561},
  year      = {2019},
  url       = {http://arxiv.org/abs/1903.04561},
  archivePrefix = {arXiv},
  eprint    = {1903.04561},
  timestamp = {Sun, 31 Mar 2019 19:01:24 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/abs-1903-04561},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

civil_comments/CivilComments은폐

구성 설명 : 경고: CivilCommentsCovert에 잠재적인 데이터 품질 문제가 있으며 이를 수정하기 위해 적극적으로 노력하고 있습니다(06/28/22). 기본 데이터가 변경될 수 있습니다!

CivilCommentsCovert 세트는 독성 및 식별 레이블 외에도 은밀한 공격성에 대해 주석이 추가로 추가된 기차 및 테스트 분할의 ~20%가 있는 CivilCommentsIdentities의 하위 집합입니다. 평가자들은 댓글을 명시적, 암시적, 그렇지 않음 또는 공격적인지 확실하지 않은 것으로 분류하고 다양한 유형의 은밀한 공격성을 포함하는지 여부를 묻도록 요청했습니다. 전체 주석 절차는 https://sites.google.com/corp/view/hciandnlp/accepted-papers 의 향후 문서에 자세히 설명되어 있습니다.

데이터 세트 크기 : 97.83 MiB
자동 캐시 ( 문서 ): 예
분할 :

나뉘다	예
`'test'`	2,455
`'train'`	48,074

기능 구조 :

FeaturesDict({
    'article_id': int32,
    'asian': float32,
    'atheist': float32,
    'bisexual': float32,
    'black': float32,
    'buddhist': float32,
    'christian': float32,
    'covert_emoticons_emojis': float32,
    'covert_humor': float32,
    'covert_masked_harm': float32,
    'covert_microaggression': float32,
    'covert_obfuscation': float32,
    'covert_political': float32,
    'covert_sarcasm': float32,
    'created_date': string,
    'explicitly_offensive': float32,
    'female': float32,
    'heterosexual': float32,
    'hindu': float32,
    'homosexual_gay_or_lesbian': float32,
    'id': string,
    'identity_attack': float32,
    'implicitly_offensive': float32,
    'insult': float32,
    'intellectual_or_learning_disability': float32,
    'jewish': float32,
    'latino': float32,
    'male': float32,
    'muslim': float32,
    'not_offensive': float32,
    'not_sure_offensive': float32,
    'obscene': float32,
    'other_disability': float32,
    'other_gender': float32,
    'other_race_or_ethnicity': float32,
    'other_religion': float32,
    'other_sexual_orientation': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'physical_disability': float32,
    'psychiatric_or_mental_illness': float32,
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
    'transgender': float32,
    'white': float32,
})

기능 문서 :

특징	수업	D타입
	풍모Dict
기사 ID	텐서	int32
아시아 사람	텐서	float32
무신론자	텐서	float32
양성애자	텐서	float32
검은색	텐서	float32
불교	텐서	float32
신자	텐서	float32
은밀한_이모티콘_이모지	텐서	float32
은밀한 유머	텐서	float32
covert_masked_harm	텐서	float32
covert_microaggression	텐서	float32
은밀한 난독화	텐서	float32
은밀한 정치	텐서	float32
은밀한 풍자	텐서	float32
만든 날짜	텐서	끈
명시적으로 공격적	텐서	float32
여성	텐서	float32
이성애자	텐서	float32
힌두 인	텐서	float32
homosexual_gay_or_lesbian	텐서	float32
ID	텐서	끈
신원 공격	텐서	float32
암묵적으로 공격적	텐서	float32
모욕	텐서	float32
지적_or_학습_장애	텐서	float32
유태인	텐서	float32
라틴계	텐서	float32
남성	텐서	float32
이슬람교도	텐서	float32
not_offensive	텐서	float32
not_sure_offensive	텐서	float32
역겨운	텐서	float32
other_disability	텐서	float32
other_gender	텐서	float32
other_race_or_ethnicity	텐서	float32
other_religion	텐서	float32
other_sexual_orientation	텐서	float32
parent_id	텐서	int32
parent_text	텍스트	끈
신체장애	텐서	float32
정신과 또는 정신 질환	텐서	float32
publication_id	텐서	끈
심한 독성	텐서	float32
성적_노골적	텐서	float32
텍스트	텍스트	끈
위협	텐서	float32
독성	텐서	float32
트랜스 젠더	텐서	float32
하얀색	텐서	float32

감독 키 ( as_supervised 문서 참조): ('text', 'toxicity')
예 ( tfds.as_dataframe ):

인용 :

@inproceedings{lees-etal-2021-capturing,
    title = "Capturing Covertly Toxic Speech via Crowdsourcing",
    author = "Lees, Alyssa  and
      Borkan, Daniel  and
      Kivlichan, Ian  and
      Nario, Jorge  and
      Goyal, Tesh",
    booktitle = "Proceedings of the First Workshop on Bridging Human{--}Computer Interaction and Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.hcinlp-1.3",
    pages = "14--20"
}

civil_comments/CivilCommentsToxicSpans

구성 설명 : CivilComments 독성 범위는 범위 수준에서 레이블이 지정된 CivilComments의 하위 집합입니다. 대부분의 주석자가 독성으로 태그가 지정된 모든 문자(유니코드 코드 포인트) 경계의 인덱스는 '범위' 기능에서 반환됩니다.
데이터 세트 크기 : 5.81 MiB
자동 캐시 ( 문서 ): 예
분할 :

나뉘다	예
`'test'`	2,000
`'train'`	7,939
`'validation'`	682

기능 구조 :

FeaturesDict({
    'article_id': int32,
    'created_date': string,
    'id': string,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'spans': Tensor(shape=(None,), dtype=int32),
    'text': Text(shape=(), dtype=string),
})

기능 문서 :

특징	수업	모양	D타입
	풍모Dict
기사 ID	텐서		int32
만든 날짜	텐서		끈
ID	텐서		끈
parent_id	텐서		int32
parent_text	텍스트		끈
publication_id	텐서		끈
경간	텐서	(없음,)	int32
텍스트	텍스트		끈

감독 키 ( as_supervised 문서 참조): ('text', 'spans')
예 ( tfds.as_dataframe ):

인용 :

@inproceedings{pavlopoulos-etal-2021-semeval,
    title = "{S}em{E}val-2021 Task 5: Toxic Spans Detection",
    author = "Pavlopoulos, John  and Sorensen, Jeffrey  and Laugier, L{'e}o and Androutsopoulos, Ion",
    booktitle = "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.semeval-1.6",
    doi = "10.18653/v1/2021.semeval-1.6",
    pages = "59--69",
}

civil_comments/CivilCommentsInContext

구성 설명 : 컨텍스트의 CivilComments는 라벨러가 parent_text를 사용할 수 있도록 하여 라벨이 지정된 CivilComments의 하위 집합입니다. contextual_toxicity 기능이 포함되어 있습니다.
데이터 세트 크기 : 9.63 MiB
자동 캐시 ( 문서 ): 예
분할 :

나뉘다	예
`'train'`	9,969

기능 구조 :

FeaturesDict({
    'article_id': int32,
    'contextual_toxicity': float32,
    'created_date': string,
    'id': string,
    'identity_attack': float32,
    'insult': float32,
    'obscene': float32,
    'parent_id': int32,
    'parent_text': Text(shape=(), dtype=string),
    'publication_id': string,
    'severe_toxicity': float32,
    'sexual_explicit': float32,
    'text': Text(shape=(), dtype=string),
    'threat': float32,
    'toxicity': float32,
})

기능 문서 :

특징	수업	D타입
	풍모Dict
기사 ID	텐서	int32
contextual_독성	텐서	float32
만든 날짜	텐서	끈
ID	텐서	끈
신원 공격	텐서	float32
모욕	텐서	float32
역겨운	텐서	float32
parent_id	텐서	int32
parent_text	텍스트	끈
publication_id	텐서	끈
심한 독성	텐서	float32
성적_노골적	텐서	float32
텍스트	텍스트	끈
위협	텐서	float32
독성	텐서	float32

감독 키 ( as_supervised 문서 참조): ('text', 'toxicity')
예 ( tfds.as_dataframe ):

인용 :

@misc{pavlopoulos2020toxicity,
    title={Toxicity Detection: Does Context Really Matter?},
    author={John Pavlopoulos and Jeffrey Sorensen and Lucas Dixon and Nithum Thain and Ion Androutsopoulos},
    year={2020}, eprint={2006.00998}, archivePrefix={arXiv}, primaryClass={cs.CL}
}