ค4

อ้างอิง:

ห้องน้ำในตัว

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:c4/en')

คำอธิบาย :

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: "https://commoncrawl.org".

This is the processed version of Google's C4 dataset by AllenAI.

ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
เวอร์ชั่น : 0.0.0
แยก :

แยก	ตัวอย่าง
`'train'`	364868892
`'validation'`	364608

คุณสมบัติ :

{
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "timestamp": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "url": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

en.noblocklist

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:c4/en.noblocklist')

คำอธิบาย :

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: "https://commoncrawl.org".

This is the processed version of Google's C4 dataset by AllenAI.

ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
เวอร์ชั่น : 0.0.0
แยก :

แยก	ตัวอย่าง
`'train'`	393391519
`'validation'`	393226

คุณสมบัติ :

{
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "timestamp": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "url": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

เหมือนข่าวจริง

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:c4/realnewslike')

คำอธิบาย :

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: "https://commoncrawl.org".

This is the processed version of Google's C4 dataset by AllenAI.

ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
เวอร์ชั่น : 0.0.0
แยก :

แยก	ตัวอย่าง
`'train'`	13799838
`'validation'`	13863

คุณสมบัติ :

{
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "timestamp": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "url": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}

th.noclean

ใช้คำสั่งต่อไปนี้เพื่อโหลดชุดข้อมูลนี้ใน TFDS:

ds = tfds.load('huggingface:c4/en.noclean')

คำอธิบาย :

A colossal, cleaned version of Common Crawl's web crawl corpus.

Based on Common Crawl dataset: "https://commoncrawl.org".

This is the processed version of Google's C4 dataset by AllenAI.

ใบอนุญาต : ไม่มีใบอนุญาตที่รู้จัก
เวอร์ชั่น : 0.0.0
แยก :

แยก	ตัวอย่าง
`'train'`	1063805381
`'validation'`	1065029

คุณสมบัติ :

{
    "text": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "timestamp": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    },
    "url": {
        "dtype": "string",
        "id": null,
        "_type": "Value"
    }
}