Tài liệu tham khảo:
không xáo trộn_deduplicate_af
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 130640 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_als
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4518 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_arz
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 79928 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_an
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2025 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_ast
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5343 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_ba
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 27050 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_am
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 43102 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_as
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho ngữ liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9212 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_azb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9985 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_be
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 307405 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_bo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 15762 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_bxr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 36 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_ceb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 26145 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_az
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 626796 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_bcl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_cy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 98225 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_dsb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 37 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_bn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1114481 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_bs
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 702 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_ce
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2984 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_cv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 10130 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_diq
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_eml
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 80 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_et
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1172041 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_bg
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho ngữ liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3398679 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_bpy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc quyền sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho ngữ liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1770 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_ca
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc quyền sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2458067 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_ckb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 68210 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_ar
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho tài liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9006977 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_av
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo chương trình cấp phép này. Chúng tôi không sở hữu bất kỳ văn bản nào mà dữ liệu này được trích xuất. Chúng tôi cấp phép cho việc đóng gói thực tế những dữ liệu này theo giấy phép Creative Commons CC0 ("không bảo lưu quyền") http://creativecommons.org/publicdomain/zero/1.0/ Trong phạm vi có thể theo luật, Inria đã từ bỏ tất cả bản quyền và các hoặc có liên quan quyền lân cận đối với OSCAR Tác phẩm này được xuất bản từ: Pháp.
Nếu bạn cho rằng dữ liệu của chúng tôi chứa tài liệu thuộc sở hữu của bạn và do đó không được sao chép ở đây, vui lòng:
- Xác định rõ ràng danh tính của bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng tác phẩm có bản quyền bị cho là bị vi phạm.
- Xác định rõ ràng tài liệu được cho là vi phạm và thông tin đầy đủ hợp lý để cho phép chúng tôi xác định tài liệu đó.
Chúng tôi sẽ tuân thủ các yêu cầu chính đáng bằng cách xóa các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của kho ngữ liệu.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 360 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_bar
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có tiếng vang_deduplicated_bh
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 82 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Unclesfled_deduplicated_br
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 14724 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupliced_cbk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_da
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4771098 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Unclesfled_dedupplated_dv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 17024 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_eo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 84752 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupplated_fa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 8203495 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_fy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 20661 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Unschuffled_dedupplated_gn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 68 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupplicated_cs
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 12308039 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_hi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1909387 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_hu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6582908 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_ie
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 11 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupliced_fr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 59448891 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Unclesfled_dedupplated_gd
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3883 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupliced_gu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 169834 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_hsb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3084 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_ia
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 529 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Unclesfled_deduplicated_io
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 617 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_jbo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 617 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupplicated_km
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 108346 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_ku
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 29054 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupplated_la
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 18808 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_lmo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1374 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
Uncleseled_dedupplated_lv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 843195 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_min
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 166 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupliced_mr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 212556 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_dedupplated_mwl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Xác định rõ ràng các tài liệu được tuyên bố là vi phạm và thông tin đủ hợp lý để cho phép chúng tôi xác định vị trí tài liệu.
Chúng tôi sẽ tuân thủ các yêu cầu hợp pháp bằng cách loại bỏ các nguồn bị ảnh hưởng khỏi bản phát hành tiếp theo của Corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không có shuffled_deduplicated_nah
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
Giấy phép : Những dữ liệu này được phát hành theo sơ đồ cấp phép này, chúng tôi không sở hữu bất kỳ văn bản nào mà từ đó dữ liệu này đã được trích xuất. Chúng tôi cấp phép bao bì thực tế của các dữ liệu này theo giấy phép Creative Commons CC0 ("Không có quyền") Quyền lân cận cho Oscar Tác phẩm này được công bố từ: Pháp.
Nếu bạn xem xét rằng dữ liệu của chúng tôi chứa các tài liệu thuộc sở hữu của bạn và do đó không nên được sao chép ở đây, xin vui lòng:
- Xác định rõ ràng chính bạn, với dữ liệu liên hệ chi tiết như địa chỉ, số điện thoại hoặc địa chỉ email mà bạn có thể liên hệ.
- Xác định rõ ràng các công việc có bản quyền được tuyên bố là bị xâm phạm.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 58 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2126 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6485 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 67921 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 28522082 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 372158 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5044757 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 17 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3675420 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 68 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1381 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 72 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 13343 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 453904 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 183443 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 8714 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 109118 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2559 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2859 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 411 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7121 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2820821 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 17610 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 42 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 645747 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_ta
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 833101 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_tk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4694 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_tyv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 24 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_uz
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 15074 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_wa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 677 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2418 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 11014487 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 56259 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 62398034 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_tr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 11596446 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6521169 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_uk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7782375 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_vi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9897709 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_wuu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 64 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_yo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 49 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_als
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7324 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_arz
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 158113 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_az
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 912330 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1675515 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2143 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4042 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 20281 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 84 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2093621 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_zh
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 41708901 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_an
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2449 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_ast
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6999 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 42551 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5869686 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6046 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4390754 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 103639 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 56326016 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7664010 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 21018 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 121168 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5326443 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 46493 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 484 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 321484 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 396093 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1578 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 13704702 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_fy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 33053 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 106 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3264660 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 11197780 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 101 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 39496439 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 338073 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1377 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 86561 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 118 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1737411 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2515 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 197878 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 16383 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 917 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 219334 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3229940 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 87235 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_pnb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3463 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 34 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_sah
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 8555 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 120684 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 461598 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 24803 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_th
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3749826 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_tt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 82738 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_ur
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 428674 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_vo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3317 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 36 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_yue
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_am
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 83663 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_as
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 14985 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 15446 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 586031 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_bo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 26795 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 42 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 56248 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_cy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 157698 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 65 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 96742378 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5799 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 240691 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_hsb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7959 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1040 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 694 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 832 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 159363 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 46535 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 94588 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1401 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1593820 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 220 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 326804 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 8 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 61 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4696 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 10709 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 98216 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9387265 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 21 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5492194 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1013619 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_ta
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1263280 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_tk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6456 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_tyv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 34 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_uz
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 27537 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_wa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1001 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_xmf
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3783 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 46981781 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 563916 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7345075 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 203 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1485 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 88 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 17957 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 603937 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 534016 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 18174 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 185884 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5213 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3225 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 452 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 14291 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 36700 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 156 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 17395625 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 89002 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_tr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 18535253 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_uk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 12973467 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_vi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 14898250 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_wuu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 214 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_yo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 214 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_zh
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 60137667 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 304230423 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 256513 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 284320 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2375030 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9948521 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 389515 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1163 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 251064 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 924 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 21735 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 32652 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 25 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 299457 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_mrj
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 669 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 136639 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 55 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 20812149 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 44230 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 20682611 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 26920397 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 115954598 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 33925 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 886223 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 511 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_te
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 312644 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 294132 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_ug
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 15503 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_deduplicate_vec
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 64 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicate_war
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9161 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_không trùng lặp_yi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 32919 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_af
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 201117 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_ar
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 16365602 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_av
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 456 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 336 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 37085 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 21001388 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 104913504 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 10425596 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 88199221 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 8557453 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 83223 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 640 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 582219 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 659430 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2638 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 62721527 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 524591 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1581 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 146993 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 137 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 2977757 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3212 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 395605 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 26598 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1055 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 299938 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 5546211 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 127467 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_pnb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 4599 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 41 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_sah
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 22301 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 203082 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_sq
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 672077 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 41986 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_th
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 6064129 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_tt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 135923 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_ur
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 638596 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_origin_vo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3366 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_xal
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 39 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_yue
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 11 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 455994980 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 506883 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 7 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 544388 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 3808397 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 13 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 16236463 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 625673 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1445 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 350363 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1549 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 34807 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 52910 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 123 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 437871 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_mrj
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 757 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 232329 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 73 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 34682142 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 59463 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 35440972 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Phiên bản : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 42114520 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 161836003 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 44280 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 1746604 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 805 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_te
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 475703 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 458206 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_ug
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 22255 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_vec
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 73 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_origin_war
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 9760 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
không xáo trộn_origin_yi
Sử dụng lệnh sau để tải tập dữ liệu này trong TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- Sự miêu tả :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
Chia tách :
Tách ra | Ví dụ |
---|---|
'train' | 59364 |
- Đặc trưng :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}