ms_marco

مراجع:

نسخه 1.1

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:ms_marco/v1.1')

توضیحات :

Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search.

The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. 
Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, 
keyphrase extraction dataset, crawling dataset, and a conversational search.

There have been 277 submissions. 20 KeyPhrase Extraction submissions, 87 passage ranking submissions, 0 document ranking 
submissions, 73 QnA V2 submissions, 82 NLGEN submisions, and 15 QnA V1 submissions

This data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1). 

The original question answering datset featured 100,000 examples and was released in 2016. Leaderboard is now closed but data is availible below.

The current competitive tasks are Question Answering and Natural Language Generation. Question Answering features over 1,000,000 queries and 
is much like the original QnA dataset but bigger and with higher quality. The Natural Language Generation dataset features 180,000 examples and 
builds upon the QnA dataset to deliver answers that could be spoken by a smart speaker.


version v1.1

مجوز : مجوز شناخته شده ای وجود ندارد
نسخه : 1.1.0
تقسیم ها :

تقسیم کنید	نمونه ها
`'test'`	9650
`'train'`	82326
`'validation'`	10047

ویژگی ها :

{
    "answers": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "passages": {
        "feature": {
            "is_selected": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "passage_text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            },
            "url": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "query": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "query_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "query_type": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "wellFormedAnswers": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}

نسخه 2.1

برای بارگذاری این مجموعه داده در TFDS از دستور زیر استفاده کنید:

ds = tfds.load('huggingface:ms_marco/v2.1')

توضیحات :

Starting with a paper released at NIPS 2016, MS MARCO is a collection of datasets focused on deep learning in search.

The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. 
Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, 
keyphrase extraction dataset, crawling dataset, and a conversational search.

There have been 277 submissions. 20 KeyPhrase Extraction submissions, 87 passage ranking submissions, 0 document ranking 
submissions, 73 QnA V2 submissions, 82 NLGEN submisions, and 15 QnA V1 submissions

This data comes in three tasks/forms: Original QnA dataset(v1.1), Question Answering(v2.1), Natural Language Generation(v2.1). 

The original question answering datset featured 100,000 examples and was released in 2016. Leaderboard is now closed but data is availible below.

The current competitive tasks are Question Answering and Natural Language Generation. Question Answering features over 1,000,000 queries and 
is much like the original QnA dataset but bigger and with higher quality. The Natural Language Generation dataset features 180,000 examples and 
builds upon the QnA dataset to deliver answers that could be spoken by a smart speaker.


version v2.1

مجوز : مجوز شناخته شده ای وجود ندارد
نسخه : 2.1.0
تقسیم ها :

تقسیم کنید	نمونه ها
`'test'`	101092
`'train'`	808731
`'validation'`	101093

ویژگی ها :

{
    "answers": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "passages": {
        "feature": {
            "is_selected": {
                "dtype": "int32",
                "id": null,
                "_type": "Value"
            },
            "passage_text": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            },
            "url": {
                "dtype": "string",
                "id": null,
                "_type": "Value"
            }
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    },
    "query": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "query_id": {
        "dtype": "int32",
        "id": null,
        "_type": "Value"
    },
    "query_type": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "wellFormedAnswers": {
        "feature": {
            "dtype": "string",
            "id": null,
            "_type": "Value"
        },
        "length": -1,
        "id": null,
        "_type": "Sequence"
    }
}