과학 논문

설명 :

과학 논문 데이터 세트에는 길고 구조화된 문서 두 세트가 포함되어 있습니다. 데이터 세트는 ArXiv 및 PubMed OpenAccess 리포지토리에서 가져옵니다.

"arxiv"와 "pubmed"에는 두 가지 기능이 있습니다.

기사: 문서의 본문, "/n"으로 구분되는 문단.
초록: 문서의 초록, "/n"으로 구분되는 페이지 그래프.
section_names: "/n"으로 구분된 섹션 제목.
추가 문서 : 코드가 있는 논문에서 탐색
홈페이지 : https://github.com/armancohan/long-summarization
소스 코드 : tfds.datasets.scientific_papers.Builder
버전 :
- 1.1.0 : 릴리스 노트가 없습니다.
- 1.1.1 (기본값): 릴리스 정보가 없습니다.
다운로드 크기 : 4.20 GiB
자동 캐시 ( 문서 ): 아니요
기능 구조 :

FeaturesDict({
    'abstract': Text(shape=(), dtype=string),
    'article': Text(shape=(), dtype=string),
    'section_names': Text(shape=(), dtype=string),
})

기능 문서 :

특징	수업	D타입
	풍모Dict
요약	텍스트	끈
기사	텍스트	끈
섹션 이름	텍스트	끈

감독된 키 ( as_supervised 문서 참조): ('article', 'abstract')
그림 ( tfds.show_examples ): 지원되지 않습니다.
인용 :

@article{Cohan_2018,
   title={A Discourse-Aware Attention Model for Abstractive Summarization of
            Long Documents},
   url={http://dx.doi.org/10.18653/v1/n18-2097},
   DOI={10.18653/v1/n18-2097},
   journal={Proceedings of the 2018 Conference of the North American Chapter of
          the Association for Computational Linguistics: Human Language
          Technologies, Volume 2 (Short Papers)},
   publisher={Association for Computational Linguistics},
   author={Cohan, Arman and Dernoncourt, Franck and Kim, Doo Soon and Bui, Trung and Kim, Seokhwan and Chang, Walter and Goharian, Nazli},
   year={2018}
}

Scientific_papers/arxiv(기본 구성)

구성 설명 : ArXiv 저장소의 문서.
데이터세트 크기 : 7.07 GiB
분할 :

나뉘다	예
`'test'`	6,440
`'train'`	203,037
`'validation'`	6,436

예 ( tfds.as_dataframe ):

Scientific_papers/pubmed

구성 설명 : PubMed 저장소의 문서.
데이터세트 크기 : 2.34 GiB
분할 :

나뉘다	예
`'test'`	6,658
`'train'`	119,924
`'validation'`	6,633

예 ( tfds.as_dataframe ):

과학 논문 컬렉션을 사용해 정리하기 내 환경설정을 기준으로 콘텐츠를 저장하고 분류하세요.

Scientific_papers/arxiv(기본 구성)

Scientific_papers/pubmed

과학 논문