TFDS hiện hỗ trợ định dạng Croissant 🥐 ! Đọc tài liệu để biết thêm.

Trang này được dịch bởi Cloud Translation API.

đá quý

Mô tả :

GEM là một môi trường chuẩn cho Tạo ngôn ngữ tự nhiên, tập trung vào Đánh giá, cả thông qua chú thích của con người và Số liệu tự động.

GEM nhằm mục đích: (1) đo lường tiến độ NLG trên 13 bộ dữ liệu bao gồm nhiều nhiệm vụ và ngôn ngữ NLG. (2) cung cấp phân tích chuyên sâu về dữ liệu và mô hình được trình bày thông qua báo cáo dữ liệu và bộ thử thách. (3) phát triển các tiêu chuẩn để đánh giá văn bản được tạo bằng cách sử dụng cả số liệu tự động và con người.

Thông tin thêm có thể được tìm thấy tại https://gem-benchmark.com .

Tài liệu bổ sung : Khám phá trên giấy tờ với mã
Trang chủ : https://gem-benchmark.com
Mã nguồn : tfds.text.gem.Gem
Phiên bản :
- 1.0.0 : Phiên bản ban đầu
- 1.0.1 : Cập nhật bộ lọc liên kết xấu cho MLSum
- 1.1.0 (mặc định): Phát hành Bộ thử thách
Các khóa được giám sát (Xem as_supervised doc ): None
Hình ( tfds.show_examples ): Không được hỗ trợ.

gem/common_gen (cấu hình mặc định)

Mô tả cấu hình : CommonGen là một tác vụ tạo văn bản có giới hạn, được liên kết với một tập dữ liệu điểm chuẩn, để kiểm tra rõ ràng các máy về khả năng lập luận thông thường tổng quát. Đưa ra một tập hợp các khái niệm phổ biến; nhiệm vụ là tạo ra một câu mạch lạc mô tả một tình huống hàng ngày bằng cách sử dụng các khái niệm này.
Kích thước tải xuống : 1.84 MiB
Kích thước tập dữ liệu : 16.84 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'challenge_test_scramble'`	500
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	1,497
`'train'`	67,389
`'validation'`	993

Cấu trúc tính năng :

FeaturesDict({
    'concept_set_id': int32,
    'concepts': Sequence(string),
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
khái niệm_set_id	tenxơ		int32
các khái niệm	Trình tự (Tensor)	(Không có,)	sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{lin2020commongen,
  title = "CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
  author = "Lin, Bill Yuchen  and
    Zhou, Wangchunshu  and
    Shen, Ming  and
    Zhou, Pei  and
    Bhagavatula, Chandra  and
    Choi, Yejin  and
    Ren, Xiang",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
  month = nov,
  year = "2020",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
  pages = "1823--1840",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/cs_nhà hàng

Mô tả cấu hình : Nhiệm vụ đang tạo phản hồi trong ngữ cảnh của hệ thống đối thoại (giả định) cung cấp thông tin về các nhà hàng. Đầu vào là loại hành động đối thoại/ý định cơ bản và danh sách các vị trí (thuộc tính) và giá trị của chúng. Đầu ra là một câu ngôn ngữ tự nhiên.
Kích thước tải xuống : 1.46 MiB
Kích thước tập dữ liệu : 2.71 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'challenge_test_scramble'`	500
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	842
`'train'`	3,569
`'validation'`	781

Cấu trúc tính năng :

FeaturesDict({
    'dialog_act': string,
    'dialog_act_delexicalized': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'target_delexicalized': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
hộp thoại_act	tenxơ		sợi dây
hộp thoại_act_delexicalized	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây
target_delexicalized	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{cs_restaurants,
  address = {Tokyo, Japan},
  title = {Neural {Generation} for {Czech}: {Data} and {Baselines} },
  shorttitle = {Neural {Generation} for {Czech} },
  url = {https://www.aclweb.org/anthology/W19-8670/},
  urldate = {2019-10-18},
  booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
  author = {Dušek, Ondřej and Jurčíček, Filip},
  month = oct,
  year = {2019},
  pages = {563--574}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/phi tiêu

Mô tả cấu hình : DART là kho dữ liệu Tạo bản ghi dữ liệu thành văn bản có cấu trúc miền lớn và miền mở với các chú thích câu chất lượng cao với mỗi đầu vào là một tập hợp các bộ ba quan hệ thực thể tuân theo bản thể luận cấu trúc cây.
Kích thước tải xuống : 28.01 MiB
Kích thước tập dữ liệu : 33.78 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	6,959
`'train'`	62,659
`'validation'`	2.768

Cấu trúc tính năng :

FeaturesDict({
    'dart_id': int32,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'subtree_was_extended': bool,
    'target': string,
    'target_sources': Sequence(string),
    'tripleset': Sequence(string),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
dart_id	tenxơ		int32
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
cây con_was_extends	tenxơ		bool
Mục tiêu	tenxơ		sợi dây
target_sources	Trình tự (Tensor)	(Không có,)	sợi dây
bộ ba	Trình tự (Tensor)	(Không có,)	sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@article{radev2020dart,
  title=Dart: Open-domain structured data record to text generation,
  author={Radev, Dragomir and Zhang, Rui and Rau, Amrit and Sivaprasad, Abhinand and Hsieh, Chiachun and Rajani, Nazneen Fatema and Tang, Xiangru and Vyas, Aadit and Verma, Neha and Krishna, Pranav and others},
  journal={arXiv preprint arXiv:2007.02871},
  year={2020}
}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/e2e_nlg

Mô tả cấu hình : Tập dữ liệu E2E được thiết kế cho tác vụ chuyển dữ liệu thành văn bản trong miền giới hạn -- tạo mô tả/đề xuất nhà hàng dựa trên tối đa 8 thuộc tính khác nhau (tên, khu vực, phạm vi giá, v.v.)
Kích thước tải xuống : 13.99 MiB
Kích thước tập dữ liệu : 16.92 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'challenge_test_scramble'`	500
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	4,693
`'train'`	33,525
`'validation'`	4.299

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'meaning_representation': string,
    'references': Sequence(string),
    'target': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
ý nghĩa_đại diện	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{e2e_cleaned,
  address = {Tokyo, Japan},
  title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation} },
  url = {https://www.aclweb.org/anthology/W19-8652/},
  booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
  author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
  year = {2019},
  pages = {421--426},
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/mlsum_de

Mô tả cấu hình : MLSum là tập dữ liệu tóm tắt đa ngôn ngữ quy mô lớn. Nó được xây dựng từ các cửa hàng tin tức trực tuyến, sự phân chia này tập trung vào tiếng Đức.
Kích thước tải xuống : 345.98 MiB
Kích thước tập dữ liệu : 963.60 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'challenge_test_covid'`	5,058
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	10,695
`'train'`	220,748
`'validation'`	11.392

Cấu trúc tính năng :

FeaturesDict({
    'date': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'text': string,
    'title': string,
    'topic': string,
    'url': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
ngày	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây
chữ	tenxơ		sợi dây
Tiêu đề	tenxơ		sợi dây
chủ đề	tenxơ		sợi dây
url	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = {Scialom, Thomas  and Dray, Paul-Alexis  and Lamprier, Sylvain  and Piwowarski, Benjamin  and Staiano, Jacopo},
    booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year = {2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/mlsum_es

Mô tả cấu hình : MLSum là tập dữ liệu tóm tắt đa ngôn ngữ quy mô lớn. Nó được xây dựng từ các cửa hàng tin tức trực tuyến, sự phân chia này tập trung vào tiếng Tây Ban Nha.
Kích thước tải xuống : 501.27 MiB
Kích thước tập dữ liệu : 1.29 GiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'challenge_test_covid'`	1.938
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	13,366
`'train'`	259,888
`'validation'`	9,977

Cấu trúc tính năng :

FeaturesDict({
    'date': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'text': string,
    'title': string,
    'topic': string,
    'url': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
ngày	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây
chữ	tenxơ		sợi dây
Tiêu đề	tenxơ		sợi dây
chủ đề	tenxơ		sợi dây
url	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{scialom-etal-2020-mlsum,
    title = "{MLSUM}: The Multilingual Summarization Corpus",
    author = {Scialom, Thomas  and Dray, Paul-Alexis  and Lamprier, Sylvain  and Piwowarski, Benjamin  and Staiano, Jacopo},
    booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year = {2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/lược đồ_guided_dialog

Mô tả cấu hình : Bộ dữ liệu Schema-Guided Dialogue (SGD) chứa các cuộc đối thoại định hướng nhiệm vụ đa miền 18K giữa con người và trợ lý ảo, bao gồm 17 miền từ ngân hàng và sự kiện đến phương tiện truyền thông, lịch, du lịch và thời tiết.
Kích thước tải xuống : 17.00 MiB
Kích thước tập dữ liệu : 201.19 MiB
Tự động lưu vào bộ nhớ đệm ( tài liệu ): Có (challenge_test_backtranslation, challenge_test_bfp02, challenge_test_bfp05, challenge_test_nopunc, challenge_test_scramble, challenge_train_sample, challenge_validation_sample, test, validation), Chỉ khi shuffle_files=False (đào tạo)
Chia tách :

Tách ra	ví dụ
`'challenge_test_backtranslation'`	500
`'challenge_test_bfp02'`	500
`'challenge_test_bfp05'`	500
`'challenge_test_nopunc'`	500
`'challenge_test_scramble'`	500
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	10.000
`'train'`	164.982
`'validation'`	10.000

Cấu trúc tính năng :

FeaturesDict({
    'context': Sequence(string),
    'dialog_acts': Sequence({
        'act': ClassLabel(shape=(), dtype=int64, num_classes=18),
        'slot': string,
        'values': Sequence(string),
    }),
    'dialog_id': string,
    'gem_id': string,
    'gem_parent_id': string,
    'prompt': string,
    'references': Sequence(string),
    'service': string,
    'target': string,
    'turn_id': int32,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
định nghĩa bài văn	Trình tự (Tensor)	(Không có,)	sợi dây
hộp thoại_acts	Sự phối hợp
hộp thoại_hành động/hành động	LớpNhãn		int64
hộp thoại_acts/slot	tenxơ		sợi dây
hộp thoại_hành động/giá trị	Trình tự (Tensor)	(Không có,)	sợi dây
hộp thoại_id	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
lời nhắc	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Dịch vụ	tenxơ		sợi dây
Mục tiêu	tenxơ		sợi dây
turn_id	tenxơ		int32

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@article{rastogi2019towards,
  title={Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset},
  author={Rastogi, Abhinav and Zang, Xiaoxue and Sunkara, Srinivas and Gupta, Raghav and Khaitan, Pranav},
  journal={arXiv preprint arXiv:1909.05855},
  year={2019}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/totto

Mô tả cấu hình : ToTTo là một tác vụ NLG chuyển đổi bảng thành văn bản. Nhiệm vụ như sau: Đưa ra một bảng Wikipedia có tên hàng, tên cột và các ô trong bảng, với một tập hợp con các ô được tô sáng, tạo mô tả bằng ngôn ngữ tự nhiên cho phần được tô sáng của bảng.
Kích thước tải xuống : 180.75 MiB
Kích thước tập dữ liệu : 645.86 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'challenge_test_scramble'`	500
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	7.700
`'train'`	121,153
`'validation'`	7.700

Cấu trúc tính năng :

FeaturesDict({
    'example_id': string,
    'gem_id': string,
    'gem_parent_id': string,
    'highlighted_cells': Sequence(Sequence(int32)),
    'overlap_subset': string,
    'references': Sequence(string),
    'sentence_annotations': Sequence({
        'final_sentence': string,
        'original_sentence': string,
        'sentence_after_ambiguity': string,
        'sentence_after_deletion': string,
    }),
    'table': Sequence(Sequence({
        'column_span': int32,
        'is_header': bool,
        'row_span': int32,
        'value': string,
    })),
    'table_page_title': string,
    'table_section_text': string,
    'table_section_title': string,
    'table_webpage_url': string,
    'target': string,
    'totto_id': int32,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
ví dụ_id	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
highlight_cells	Trình tự(Trình tự(Tensor))	(Không có, không có)	int32
chồng chéo_subset	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
câu_chú thích	Sự phối hợp
câu_chú thích/cuối_câu	tenxơ		sợi dây
câu_chú thích/câu gốc	tenxơ		sợi dây
câu_annotations/câu_sau_mơ hồ	tenxơ		sợi dây
câu_chú thích/câu_sau_xóa	tenxơ		sợi dây
bàn	Sự phối hợp
bảng/cột_span	tenxơ		int32
bảng/is_header	tenxơ		bool
bảng/hàng_span	tenxơ		int32
bảng/giá trị	tenxơ		sợi dây
table_page_title	tenxơ		sợi dây
bảng_phần_văn bản	tenxơ		sợi dây
bảng_phần_tiêu đề	tenxơ		sợi dây
bảng_webpage_url	tenxơ		sợi dây
Mục tiêu	tenxơ		sợi dây
totto_id	tenxơ		int32

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{parikh2020totto,
  title=ToTTo: A Controlled Table-To-Text Generation Dataset,
  author={Parikh, Ankur and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  pages={1173--1186},
  year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/web_nlg_vi

Mô tả cấu hình : WebNLG là bộ dữ liệu song ngữ (tiếng Anh, tiếng Nga) gồm bộ ba DBpedia song song và các văn bản ngắn bao gồm khoảng 450 thuộc tính DBpedia khác nhau. Dữ liệu WebNLG ban đầu được tạo ra để thúc đẩy sự phát triển của các bộ kiểm tra RDF có khả năng tạo văn bản ngắn và xử lý việc lập kế hoạch vi mô.
Kích thước tải xuống : 12.57 MiB
Kích thước tập dữ liệu : 19.91 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'challenge_test_numbers'`	500
`'challenge_test_scramble'`	500
`'challenge_train_sample'`	502
`'challenge_validation_sample'`	499
`'test'`	1.779
`'train'`	35,426
`'validation'`	1.667

Cấu trúc tính năng :

FeaturesDict({
    'category': string,
    'gem_id': string,
    'gem_parent_id': string,
    'input': Sequence(string),
    'references': Sequence(string),
    'target': string,
    'webnlg_id': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
thể loại	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
đầu vào	Trình tự (Tensor)	(Không có,)	sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây
webnlg_id	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{gardent2017creating,
  author = "Gardent, Claire
    and Shimorina, Anastasia
    and Narayan, Shashi
    and Perez-Beltrachini, Laura",
  title = "Creating Training Corpora for NLG Micro-Planners",
  booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2017",
  publisher = "Association for Computational Linguistics",
  pages = "179--188",
  location = "Vancouver, Canada",
  doi = "10.18653/v1/P17-1017",
  url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/web_nlg_ru

Mô tả cấu hình : WebNLG là bộ dữ liệu song ngữ (tiếng Anh, tiếng Nga) gồm bộ ba DBpedia song song và các văn bản ngắn bao gồm khoảng 450 thuộc tính DBpedia khác nhau. Dữ liệu WebNLG ban đầu được tạo ra để thúc đẩy sự phát triển của các bộ kiểm tra RDF có khả năng tạo văn bản ngắn và xử lý việc lập kế hoạch vi mô.
Kích thước tải xuống : 7.49 MiB
Kích thước tập dữ liệu : 11.30 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'challenge_test_scramble'`	500
`'challenge_train_sample'`	501
`'challenge_validation_sample'`	500
`'test'`	1.102
`'train'`	14.630
`'validation'`	790

Cấu trúc tính năng :

FeaturesDict({
    'category': string,
    'gem_id': string,
    'gem_parent_id': string,
    'input': Sequence(string),
    'references': Sequence(string),
    'target': string,
    'webnlg_id': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
thể loại	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
đầu vào	Trình tự (Tensor)	(Không có,)	sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây
webnlg_id	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{gardent2017creating,
  author = "Gardent, Claire
    and Shimorina, Anastasia
    and Narayan, Shashi
    and Perez-Beltrachini, Laura",
  title = "Creating Training Corpora for NLG Micro-Planners",
  booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  year = "2017",
  publisher = "Association for Computational Linguistics",
  pages = "179--188",
  location = "Vancouver, Canada",
  doi = "10.18653/v1/P17-1017",
  url = "http://www.aclweb.org/anthology/P17-1017"
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_auto_asset_turk

Mô tả cấu hình : WikiAuto cung cấp một tập hợp các câu được căn chỉnh từ Wikipedia tiếng Anh và Wikipedia tiếng Anh đơn giản làm tài nguyên để huấn luyện các hệ thống đơn giản hóa câu. ASSET và TURK là bộ dữ liệu đơn giản hóa chất lượng cao được sử dụng để thử nghiệm.
Kích thước tải xuống : 121.01 MiB
Kích thước tập dữ liệu : 202.40 MiB
Auto-cached ( documentation ): Yes (challenge_test_asset_backtranslation, challenge_test_asset_bfp02, challenge_test_asset_bfp05, challenge_test_asset_nopunc, challenge_test_turk_backtranslation, challenge_test_turk_bfp02, challenge_test_turk_bfp05, challenge_test_turk_nopunc, challenge_train_sample, challenge_validation_sample, test_asset, test_turk, validation), Only when shuffle_files=False (train)
Chia tách :

Tách ra	ví dụ
`'challenge_test_asset_backtranslation'`	359
`'challenge_test_asset_bfp02'`	359
`'challenge_test_asset_bfp05'`	359
`'challenge_test_asset_nopunc'`	359
`'challenge_test_turk_backtranslation'`	359
`'challenge_test_turk_bfp02'`	359
`'challenge_test_turk_bfp05'`	359
`'challenge_test_turk_nopunc'`	359
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test_asset'`	359
`'test_turk'`	359
`'train'`	483.801
`'validation'`	20.000

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'target': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
Mục tiêu	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{jiang-etal-2020-neural,
    title = "Neural {CRF} Model for Sentence Alignment in Text Simplification",
    author = "Jiang, Chao  and
      Maddela, Mounica  and
      Lan, Wuwei  and
      Zhong, Yang  and
      Xu, Wei",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.709",
    doi = "10.18653/v1/2020.acl-main.709",
    pages = "7943--7960",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/xsum

Mô tả cấu hình : Bộ dữ liệu dành cho nhiệm vụ tóm tắt trừu tượng ở dạng cực đoan, về việc tóm tắt một tài liệu trong một câu.
Kích thước tải xuống : 246.31 MiB
Kích thước tập dữ liệu : 78.89 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'challenge_test_backtranslation'`	500
`'challenge_test_bfp_02'`	500
`'challenge_test_bfp_05'`	500
`'challenge_test_covid'`	401
`'challenge_test_nopunc'`	500
`'challenge_train_sample'`	500
`'challenge_validation_sample'`	500
`'test'`	1.166
`'train'`	23,206
`'validation'`	1.117

Cấu trúc tính năng :

FeaturesDict({
    'document': string,
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'target': string,
    'xsum_id': string,
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
tài liệu	tenxơ		sợi dây
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
Mục tiêu	tenxơ		sợi dây
xsum_id	tenxơ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{Narayan2018dont,
  author = "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
  title = "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
  year = "2018",
  address = "Brussels, Belgium",
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_arabic_ar

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 56.25 MiB
Kích thước tập dữ liệu : 291.42 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	5,841
`'train'`	20,441
`'validation'`	2.919

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'ar': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'ar': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/ar	Chữ		sợi dây
source_aligned/vi	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/ar	Chữ		sợi dây
target_aligned/vi	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_chinese_zh

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 31.38 MiB
Kích thước tập dữ liệu : 122.06 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	3,775
`'train'`	13,211
`'validation'`	1.886

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'zh': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'zh': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/zh	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/zh	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_czech_cs

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 13.84 MiB
Kích thước tập dữ liệu : 58.05 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	1.438
`'train'`	5,033
`'validation'`	718

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'cs': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'cs': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/cs	Chữ		sợi dây
source_aligned/vi	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/cs	Chữ		sợi dây
target_aligned/vi	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_dutch_nl

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 53.88 MiB
Kích thước tập dữ liệu : 237.97 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Có (kiểm tra, xác thực), Chỉ khi shuffle_files=False (đào tạo)
Chia tách :

Tách ra	ví dụ
`'test'`	6,248
`'train'`	21,866
`'validation'`	3.123

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'nl': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'nl': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/nl	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/nl	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_english_en

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 112.56 MiB
Kích thước tập dữ liệu : 657.51 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	28,614
`'train'`	99,020
`'validation'`	13,823

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_french_fr

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 113.26 MiB
Kích thước tập dữ liệu : 522.28 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	12,731
`'train'`	44,556
`'validation'`	6,364

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'fr': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'fr': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/fr	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/fr	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_german_de

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 102.65 MiB
Kích thước tập dữ liệu : 452.46 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	11,669
`'train'`	40,839
`'validation'`	5,833

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'de': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'de': Text(shape=(), dtype=string),
        'en': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/de	Chữ		sợi dây
source_aligned/vi	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/de	Chữ		sợi dây
target_aligned/vi	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_hindi_hi

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 20.07 MiB
Kích thước tập dữ liệu : 138.06 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	1,984
`'train'`	6,942
`'validation'`	991

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'hi': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'hi': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/xin chào	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/xin chào	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_indonesian_id

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 80.08 MiB
Kích thước tập dữ liệu : 370.63 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	9,497
`'train'`	33,237
`'validation'`	4,747

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'id': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'id': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/id	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/id	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_italian_it

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 84.80 MiB
Kích thước tập dữ liệu : 374.40 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	10,189
`'train'`	35,661
`'validation'`	5,093

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'it': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'it': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/nó	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/nó	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_japanese_ja

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 21.75 MiB
Kích thước tập dữ liệu : 103.19 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	2.530
`'train'`	8,853
`'validation'`	1.264

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ja': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ja': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/ja	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/ja	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_korean_ko

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 22.26 MiB
Kích thước tập dữ liệu : 102.35 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	2.436
`'train'`	8,524
`'validation'`	1.216

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ko': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ko': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/ko	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/ko	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_portuguese_pt

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 131.17 MiB
Kích thước tập dữ liệu : 570.46 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	16.331
`'train'`	57,159
`'validation'`	8.165

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'pt': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'pt': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/pt	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/pt	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_russian_ru

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 101.36 MiB
Kích thước tập dữ liệu : 564.69 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	10.580
`'train'`	37,028
`'validation'`	5,288

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ru': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'ru': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/ru	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/ru	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_spanish_es

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 189.06 MiB
Kích thước tập dữ liệu : 849.75 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Không
Chia tách :

Tách ra	ví dụ
`'test'`	22,632
`'train'`	79,212
`'validation'`	11,316

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'es': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'es': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/es	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/es	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_thai_th

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 28.60 MiB
Kích thước tập dữ liệu : 193.77 MiB
Tự động lưu vào bộ nhớ cache ( tài liệu ): Có (kiểm tra, xác thực), Chỉ khi shuffle_files=False (đào tạo)
Chia tách :

Tách ra	ví dụ
`'test'`	2.950
`'train'`	10,325
`'validation'`	1,475

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'th': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'th': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/th	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/th	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý/wiki_lingua_turkish_tr

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 6.73 MiB
Kích thước tập dữ liệu : 30.75 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	900
`'train'`	3.148
`'validation'`	449

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'tr': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'tr': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/tr	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/tr	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

gem/wiki_lingua_vietnamese_vi

Mô tả cấu hình : Wikilingua là một bộ dữ liệu đa ngôn ngữ, quy mô lớn để đánh giá các hệ thống tóm tắt trừu tượng đa ngôn ngữ..
Kích thước tải xuống : 36.27 MiB
Kích thước tập dữ liệu : 179.77 MiB
Tự động lưu vào bộ đệm ( tài liệu ): Có
Chia tách :

Tách ra	ví dụ
`'test'`	3,917
`'train'`	13,707
`'validation'`	1.957

Cấu trúc tính năng :

FeaturesDict({
    'gem_id': string,
    'gem_parent_id': string,
    'references': Sequence(string),
    'source': string,
    'source_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'vi': Text(shape=(), dtype=string),
    }),
    'target': string,
    'target_aligned': Translation({
        'en': Text(shape=(), dtype=string),
        'vi': Text(shape=(), dtype=string),
    }),
})

Tài liệu tính năng :

Tính năng	Lớp	Hình dạng	Dtype
	Tính năngDict
gem_id	tenxơ		sợi dây
gem_parent_id	tenxơ		sợi dây
người giới thiệu	Trình tự (Tensor)	(Không có,)	sợi dây
nguồn	tenxơ		sợi dây
source_aligned	Dịch
source_aligned/vi	Chữ		sợi dây
source_aligned/vi	Chữ		sợi dây
Mục tiêu	tenxơ		sợi dây
target_aligned	Dịch
target_aligned/vi	Chữ		sợi dây
target_aligned/vi	Chữ		sợi dây

Ví dụ ( tfds.as_dataframe ):

trích dẫn :

@inproceedings{ladhak-wiki-2020,
title=WikiLingua: A New Benchmark Dataset for Multilingual Abstractive Summarization,
author={Faisal Ladhak, Esin Durmus, Claire Cardie and Kathleen McKeown},
booktitle={Findings of EMNLP, 2020},
year={2020}
}
@article{gehrmann2021gem,
  author    = {Sebastian Gehrmann and
               Tosin P. Adewumi and
               Karmanya Aggarwal and
               Pawan Sasanka Ammanamanchi and
               Aremu Anuoluwapo and
               Antoine Bosselut and
               Khyathi Raghavi Chandu and
               Miruna{-}Adriana Clinciu and
               Dipanjan Das and
               Kaustubh D. Dhole and
               Wanyu Du and
               Esin Durmus and
               Ondrej Dusek and
               Chris Emezue and
               Varun Gangal and
               Cristina Garbacea and
               Tatsunori Hashimoto and
               Yufang Hou and
               Yacine Jernite and
               Harsh Jhamtani and
               Yangfeng Ji and
               Shailza Jolly and
               Dhruv Kumar and
               Faisal Ladhak and
               Aman Madaan and
               Mounica Maddela and
               Khyati Mahajan and
               Saad Mahamood and
               Bodhisattwa Prasad Majumder and
               Pedro Henrique Martins and
               Angelina McMillan{-}Major and
               Simon Mille and
               Emiel van Miltenburg and
               Moin Nadeem and
               Shashi Narayan and
               Vitaly Nikolaev and
               Rubungo Andre Niyongabo and
               Salomey Osei and
               Ankur P. Parikh and
               Laura Perez{-}Beltrachini and
               Niranjan Ramesh Rao and
               Vikas Raunak and
               Juan Diego Rodriguez and
               Sashank Santhanam and
               Jo{\~{a} }o Sedoc and
               Thibault Sellam and
               Samira Shaikh and
               Anastasia Shimorina and
               Marco Antonio Sobrevilla Cabezudo and
               Hendrik Strobelt and
               Nishant Subramani and
               Wei Xu and
               Diyi Yang and
               Akhila Yerukola and
               Jiawei Zhou},
  title     = {The {GEM} Benchmark: Natural Language Generation, its Evaluation and
               Metrics},
  journal   = {CoRR},
  volume    = {abs/2102.01672},
  year      = {2021},
  url       = {https://arxiv.org/abs/2102.01672},
  archivePrefix = {arXiv},
  eprint    = {2102.01672}
}

Note that each GEM dataset has its own citation. Please see the source to see
the correct citation for each contained dataset."

đá quý Sử dụng bộ sưu tập để sắp xếp ngăn nắp các trang Lưu và phân loại nội dung dựa trên lựa chọn ưu tiên của bạn.