wiki_auto

설명 :

WikiAuto는 문장 단순화 시스템을 교육하기 위한 리소스로 영어 Wikipedia 및 Simple English Wikipedia에서 정렬된 문장 세트를 제공합니다. 저자는 먼저 Simple English Wikipedia의 하위 집합에 있는 문장과 영어 Wikipedia의 해당 버전( manual 구성에 해당) 간의 수동 정렬 세트를 크라우드 소싱한 다음 이러한 정렬을 예측하도록 신경 CRF 시스템을 훈련했습니다. 훈련된 모델은 Simple English Wikipedia의 다른 문서에 적용되어 더 큰 정렬된 문장 모음을 생성했습니다( auto , auto_acl , auto_full_no_split 및 auto_full_with_split 구성에 해당).

홈페이지 : https://github.com/chaojiang06/wiki-auto
소스 코드 : tfds.text_simplification.wiki_auto.WikiAuto
버전 :
- 1.0.0 (기본값): 최초 릴리스.
감독된 키 ( as_supervised 문서 참조): None
그림 ( tfds.show_examples ): 지원되지 않습니다.
인용 :

@inproceedings{acl/JiangMLZX20,
  author    = {Chao Jiang and
               Mounica Maddela and
               Wuwei Lan and
               Yang Zhong and
               Wei Xu},
  editor    = {Dan Jurafsky and
               Joyce Chai and
               Natalie Schluter and
               Joel R. Tetreault},
  title     = {Neural {CRF} Model for Sentence Alignment in Text Simplification},
  booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational
               Linguistics, {ACL} 2020, Online, July 5-10, 2020},
  pages     = {7943--7960},
  publisher = {Association for Computational Linguistics},
  year      = {2020},
  url       = {https://www.aclweb.org/anthology/2020.acl-main.709/}
}

wiki_auto/manual(기본 구성)

구성 설명 : 크라우드 작업자가 정렬한 10K Wikipedia 문장 쌍 세트입니다.
다운로드 크기 : 53.47 MiB
데이터 세트 크기 : 76.87 MiB
자동 캐시 ( 문서 ): 예
분할 :

나뉘다	예
`'dev'`	73,249
`'test'`	118,074

기능 구조 :

FeaturesDict({
    'GLEU-score': float64,
    'alignment_label': ClassLabel(shape=(), dtype=int64, num_classes=3),
    'normal_sentence': Text(shape=(), dtype=string),
    'normal_sentence_id': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
    'simple_sentence_id': Text(shape=(), dtype=string),
})

기능 문서 :

특징	수업	D타입
	풍모Dict
GLEU 점수	텐서	float64
정렬_라벨	클래스 레이블	int64
normal_sentence	텍스트	끈
normal_sentence_id	텍스트	끈
simple_sentence	텍스트	끈
simple_sentence_id	텍스트	끈

예 ( tfds.as_dataframe ):

wiki_auto/auto_acl

구성 설명 : ACL2020 시스템을 훈련시키기 위해 정렬된 문장 쌍입니다.
다운로드 크기 : 112.60 MiB
데이터 세트 크기 : 138.83 MiB
자동 캐시 됨( 문서 ): shuffle_files=False (전체)인 경우에만
분할 :

나뉘다	예
`'full'`	488,332

기능 구조 :

FeaturesDict({
    'normal_sentence': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
})

기능 문서 :

특징	수업	D타입
	풍모Dict
normal_sentence	텍스트	끈
simple_sentence	텍스트	끈

예 ( tfds.as_dataframe ):

wiki_auto/auto_full_no_split

구성 설명 : 문장 분할 없이 자동으로 정렬된 모든 문장 쌍입니다.
다운로드 크기 : 135.02 MiB
데이터 세트 크기 : 166.78 MiB
자동 캐시 됨( 문서 ): shuffle_files=False (전체)인 경우에만
분할 :

나뉘다	예
`'full'`	591,994

기능 구조 :

FeaturesDict({
    'normal_sentence': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
})

기능 문서 :

특징	수업	D타입
	풍모Dict
normal_sentence	텍스트	끈
simple_sentence	텍스트	끈

예 ( tfds.as_dataframe ):

wiki_auto/auto_full_with_split

구성 설명 : 문장 분할을 사용하여 자동으로 정렬된 모든 문장 쌍입니다.
다운로드 크기 : 115.09 MiB
데이터 세트 크기 : 141.20 MiB
자동 캐시 됨( 문서 ): shuffle_files=False (전체)인 경우에만
분할 :

나뉘다	예
`'full'`	483,801

기능 구조 :

FeaturesDict({
    'normal_sentence': Text(shape=(), dtype=string),
    'simple_sentence': Text(shape=(), dtype=string),
})

기능 문서 :

특징	수업	D타입
	풍모Dict
normal_sentence	텍스트	끈
simple_sentence	텍스트	끈

예 ( tfds.as_dataframe ):

wiki_auto/자동

구성 설명 : 자동으로 정렬된 문장 쌍의 큰 집합입니다.
다운로드 크기 : 2.01 GiB
데이터세트 크기 : 1.76 GiB
자동 캐시 ( 문서 ): 아니요
분할 :

나뉘다	예
`'part_1'`	125,059
`'part_2'`	13,036

기능 구조 :

FeaturesDict({
    'example_id': Text(shape=(), dtype=string),
    'normal': FeaturesDict({
        'normal_article_content': Sequence({
            'normal_sentence': Text(shape=(), dtype=string),
            'normal_sentence_id': Text(shape=(), dtype=string),
        }),
        'normal_article_id': int32,
        'normal_article_title': Text(shape=(), dtype=string),
        'normal_article_url': Text(shape=(), dtype=string),
    }),
    'paragraph_alignment': Sequence({
        'normal_paragraph_id': Text(shape=(), dtype=string),
        'simple_paragraph_id': Text(shape=(), dtype=string),
    }),
    'sentence_alignment': Sequence({
        'normal_sentence_id': Text(shape=(), dtype=string),
        'simple_sentence_id': Text(shape=(), dtype=string),
    }),
    'simple': FeaturesDict({
        'simple_article_content': Sequence({
            'simple_sentence': Text(shape=(), dtype=string),
            'simple_sentence_id': Text(shape=(), dtype=string),
        }),
        'simple_article_id': int32,
        'simple_article_title': Text(shape=(), dtype=string),
        'simple_article_url': Text(shape=(), dtype=string),
    }),
})

기능 문서 :

특징	수업	D타입
	풍모Dict
example_id	텍스트	끈
정상	풍모Dict
normal/normal_article_content	순서
normal/normal_article_content/normal_sentence	텍스트	끈
normal/normal_article_content/normal_sentence_id	텍스트	끈
normal/normal_article_id	텐서	int32
normal/normal_article_title	텍스트	끈
normal/normal_article_url	텍스트	끈
단락 정렬	순서
단락_정렬/정상_단락_ID	텍스트	끈
단락_정렬/simple_paragraph_id	텍스트	끈
문장 정렬	순서
sentence_alignment/normal_sentence_id	텍스트	끈
sentence_alignment/simple_sentence_id	텍스트	끈
단순한	풍모Dict
단순/simple_article_content	순서
단순/simple_article_content/simple_sentence	텍스트	끈
단순/simple_article_content/simple_sentence_id	텍스트	끈
단순/simple_article_id	텐서	int32
단순/단순_기사_제목	텍스트	끈
단순/simple_article_url	텍스트	끈

예 ( tfds.as_dataframe ):