הפניות:
unshuffled_deduplicated_af
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_af')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 130640 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_als
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_als')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4518 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_arz
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_arz')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 79928 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_an
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_an')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2025 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ast
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ast')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5343 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ba
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ba')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 27050 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_am
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_am')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 43102 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_as
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_as')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9212 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_azb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_azb')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9985 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_be
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_be')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 307405 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bo')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 15762 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bxr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bxr')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 36 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ceb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ceb')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 26145 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_az
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_az')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 626796 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bcl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bcl')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cy')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 98225 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_dsb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dsb')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 37 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bn')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1114481 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bs
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bs')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 702 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ce
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ce')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2984 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_cv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 10130 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_diq
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_diq')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eml
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eml')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 80 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_et
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_et')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1172041 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bg
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bg')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3398679 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_bpy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bpy')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1770 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ca
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ca')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2458067 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ckb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ckb')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 68210 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ar
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ar')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בצורה ברורה, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כי הופרו.
- זהה בבירור את החומר הנטען כמפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נענה לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהמהדורה הבאה של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9006977 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_av
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_av')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים תחת ערכת רישוי זו. אין לנו בבעלותנו אף אחד מהטקסט שממנו חולצו נתונים אלה. אנו מעניקים רישיון לאריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("ללא זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי חוק, אינריה ויתרה על כל זכויות יוצרים וזכויות הקשורות או קשורות זכויות שכנות ל-OSCAR עבודה זו מתפרסמת מ: צרפת.
אם אתה חושב שהנתונים שלנו מכילים חומר בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כגון כתובת, מספר טלפון או כתובת דואר אלקטרוני שבה ניתן ליצור איתך קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 360 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bar
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bar')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_bh
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_bh')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 82 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_br
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_br')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 14724 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_cbk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cbk')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_da
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_da')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4771098 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_dv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_dv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 17024 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_eo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eo')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 84752 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_fa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fa')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 8203495 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_fy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fy')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 20661 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_gn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gn')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 68 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_cs
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_cs')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 12308039 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_hi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hi')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1909387 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_hu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hu')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6582908 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ie
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ie')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 11 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_fr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fr')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 59448891 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_gd
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gd')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3883 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_gu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gu')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 169834 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_hsb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hsb')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3084 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ia
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ia')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 529 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_io
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_io')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 617 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_jbo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jbo')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 617 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_km
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_km')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 108346 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_ku
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ku')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 29054 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_la
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_la')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 18808 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_lmo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lmo')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1374 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_lv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 843195 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_min
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_min')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 166 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_mr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mr')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 212556 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_mwl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mwl')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
אם אתה שוקל שהנתונים שלנו מכילים חומר שנמצא בבעלותך ולכן אין לשכפל אותם כאן, בבקשה:
- זהה את עצמך בבירור, עם נתוני קשר מפורטים כמו כתובת, מספר טלפון או כתובת דוא"ל אליה ניתן ליצור קשר.
- זהה בבירור את היצירה המוגנת בזכויות יוצרים שנטען כפרה.
- זהה בבירור את החומר שנטען שהוא מפר ומידע מספיק סביר כדי לאפשר לנו לאתר את החומר.
אנו נציע לבקשות לגיטימיות על ידי הסרת המקורות המושפעים מהשחרור הבא של הקורפוס.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_dedupliced_nah
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nah')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
רישיון : נתונים אלה משוחררים במסגרת תוכנית רישוי זו איננו בבעלותו אף אחד מהטקסט ממנו הוצאו נתונים אלה. אנו מעניקים לרישיון האריזה בפועל של נתונים אלה תחת רישיון Creative Commons CC0 ("אין זכויות שמורות") http://creativecommons.org/publicdomain/zero/1.0/ ככל האפשר על פי החוק, אינריה ויתרה על כל זכויות היוצרים וקשורה או זכויות שכנות לאוסקר יצירה זו מתפרסמת מ: צרפת.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 58 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_new
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_new')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2126 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_oc
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_oc')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6485 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pam
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pam')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ps
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ps')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 67921 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_it
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_it')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 28522082 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ka
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ka')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 372158 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ro
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ro')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5044757 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_scn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_scn')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 17 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ko
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ko')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3675420 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kw
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kw')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 68 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lez
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lez')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1381 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lrc
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lrc')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 72 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mg
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mg')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 13343 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ml
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ml')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 453904 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ms
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ms')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 183443 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_myv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_myv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nds
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nds')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 8714 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nn')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 109118 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_os
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_os')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2559 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pms
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pms')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2859 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_qu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_qu')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 411 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sa')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7121 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sk')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2820821 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sh
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sh')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 17610 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_so
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_so')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 42 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sr')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 645747 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ta
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ta')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 833101 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tk')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4694 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tyv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tyv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 24 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uz
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uz')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 15074 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wa')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 677 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xmf
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xmf')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2418 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 11014487 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tg
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tg')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 56259 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_de
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_de')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 62398034 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tr')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 11596446 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_el
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_el')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6521169 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_uk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_uk')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7782375 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vi')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9897709 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_wuu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_wuu')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 64 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yo')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 49 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_als
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_als')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7324 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_arz
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_arz')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 158113 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_az
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_az')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 912330 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bcl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bcl')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bn')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1675515 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bs
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bs')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2143 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ce
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ce')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4042 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 20281 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_diq
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_diq')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eml
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eml')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 84 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_et
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_et')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2093621 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_zh
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_zh')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 41708901 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_an
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_an')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2449 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ast
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ast')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6999 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ba
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ba')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 42551 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bg
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bg')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5869686 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bpy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bpy')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6046 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ca
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ca')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4390754 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ckb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ckb')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 103639 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_es
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_es')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 56326016 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_da
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_da')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7664010 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dv')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 21018 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eo')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 121168 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_fi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_fi')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5326443 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ga
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ga')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 46493 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gom
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gom')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 484 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hr')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 321484 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_hy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_hy')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 396093 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ilo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ilo')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1578 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fa')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 13704702 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fy')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 33053 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gn')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 106 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hi')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3264660 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hu')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 11197780 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ie
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ie')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 101 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ja
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ja')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 39496439 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kk')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 338073 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_krc
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_krc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1377 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ky
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ky')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 86561 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_li
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_li')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 118 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1737411 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mhr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mhr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2515 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 197878 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 16383 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mzn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mzn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 917 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ne
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ne')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 219334 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_no
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_no')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3229940 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 87235 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pnb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pnb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3463 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_rm
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_rm')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 34 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sah
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 8555 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_si
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_si')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 120684 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sq
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 461598 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sw
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sw')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 24803 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_th
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_th')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3749826 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 82738 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ur
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ur')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 428674 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3317 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_xal
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_xal')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 36 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yue
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yue')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_am
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_am')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 83663 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_as
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_as')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 14985 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_azb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_azb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 15446 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_be
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_be')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 586031 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 26795 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bxr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bxr')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 42 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ceb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ceb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 56248 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 157698 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_dsb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_dsb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 65 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 96742378 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gd
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5799 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 240691 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hsb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hsb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7959 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ia
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ia')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1040 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_io
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_io')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 694 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jbo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jbo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 832 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_km
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_km')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 159363 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ku
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ku')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 46535 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_la
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_la')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 94588 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lmo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lmo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1401 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1593820 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_min
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_min')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 220 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 326804 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mwl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mwl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 8 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nah
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 61 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_new
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_new')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4696 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_oc
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_oc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 10709 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pam
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pam')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ps
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ps')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 98216 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ro
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ro')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9387265 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_scn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_scn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 21 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5492194 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1013619 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ta
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ta')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1263280 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6456 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tyv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tyv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 34 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uz
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uz')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 27537 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1001 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xmf
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xmf')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3783 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_it
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_it')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 46981781 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ka
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ka')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 563916 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ko
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ko')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7345075 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kw
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 203 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lez
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lez')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1485 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lrc
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lrc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 88 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mg
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mg')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 17957 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ml
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ml')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 603937 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ms
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ms')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 534016 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_myv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_myv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nds
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nds')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 18174 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 185884 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_os
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_os')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5213 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pms
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pms')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3225 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_qu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_qu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 452 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 14291 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sh
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 36700 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_so
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_so')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 156 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 17395625 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tg
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tg')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 89002 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 18535253 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_uk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_uk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 12973467 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 14898250 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_wuu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_wuu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 214 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 214 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_zh
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_zh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 60137667 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_en
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_en')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 304230423 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_eu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_eu')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 256513 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_frr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_frr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_gl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_gl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 284320 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_he
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_he')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2375030 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ht
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ht')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_id
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_id')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9948521 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_is
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_is')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 389515 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_jv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_jv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1163 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 251064 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_kv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_kv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 924 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 21735 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_lo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_lo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 32652 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mai
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mai')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 25 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 299457 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_mrj
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_mrj')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 669 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_my
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_my')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 136639 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nap
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nap')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 55 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_nl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_nl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 20812149 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_or
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_or')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 44230 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pl')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 20682611 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_pt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_pt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 26920397 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ru
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ru')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 115954598 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sd
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 33925 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_sl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_sl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 886223 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_su
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_su')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 511 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_te
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_te')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 312644 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_tl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_tl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 294132 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_ug
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_ug')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 15503 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_vec
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_vec')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 64 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_war
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_war')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9161 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_deduplicated_yi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_deduplicated_yi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 32919 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_af
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_af')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 201117 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ar
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ar')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 16365602 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_av
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_av')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 456 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bar
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bar')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_bh
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_bh')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 336 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_br
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_br')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 37085 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cbk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cbk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_cs
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_cs')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 21001388 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_de
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_de')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 104913504 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_el
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_el')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 10425596 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_es
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_es')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 88199221 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_fi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_fi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 8557453 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ga
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ga')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 83223 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gom
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gom')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 640 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 582219 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_hy
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_hy')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 659430 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ilo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ilo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2638 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ja
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ja')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 62721527 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 524591 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_krc
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_krc')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1581 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ky
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ky')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 146993 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_li
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_li')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 137 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lt')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 2977757 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mhr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mhr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3212 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 395605 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 26598 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mzn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mzn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1055 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ne
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ne')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 299938 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_no
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_no')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 5546211 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pa
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pa')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 127467 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pnb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pnb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 4599 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_rm
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_rm')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 41 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sah
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sah')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 22301 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_si
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_si')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 203082 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sq
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sq')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 672077 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sw
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sw')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 41986 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_th
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_th')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 6064129 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 135923 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ur
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ur')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 638596 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3366 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_xal
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_xal')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 39 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yue
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yue')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 11 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_en
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_en')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 455994980 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_eu
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_eu')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 506883 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_frr
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_frr')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 7 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_gl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_gl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 544388 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_he
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_he')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 3808397 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ht
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ht')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 13 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_id
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_id')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 16236463 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_is
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_is')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 625673 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_jv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_jv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1445 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kn
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kn')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 350363 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_kv
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_kv')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1549 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lb
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lb')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 34807 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_lo
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_lo')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 52910 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mai
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mai')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 123 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mk
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mk')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 437871 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_mrj
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_mrj')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 757 |
- תכונות :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_my
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_my')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 232329 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nap
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nap')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 73 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_nl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_nl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 34682142 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_or
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_or')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 59463 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 35440972 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_pt
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_pt')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 42114520 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ru
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ru')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 161836003 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sd
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sd')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 44280 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_sl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_sl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
גרסה : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 1746604 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_su
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_su')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 805 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_te
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_te')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 475703 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_tl
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_tl')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 458206 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_ug
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_ug')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 22255 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_vec
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_vec')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 73 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_war
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_war')
- תיאור :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 9760 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}
unshuffled_original_yi
השתמש בפקודה הבאה כדי לטעון מערך נתונים זה ב-TFDS:
ds = tfds.load('huggingface:oscar/unshuffled_original_yi')
- Description :
The Open Super-large Crawled ALMAnaCH coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture.
License : These data are released under this licensing scheme We do not own any of the text from which these data has been extracted. We license the actual packaging of these data under the Creative Commons CC0 license ("no rights reserved") http://creativecommons.org/publicdomain/zero/1.0/ To the extent possible under law, Inria has waived all copyright and related or neighboring rights to OSCAR This work is published from: France.
Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:
- Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
- Clearly identify the copyrighted work claimed to be infringed.
- Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
We will comply to legitimate requests by removing the affected sources from the next release of the corpus.
Version : 1.0.0
פיצולים :
לְפַצֵל | דוגמאות |
---|---|
'train' | 59364 |
- Features :
{
"id": {
"dtype": "int64",
"id": null,
"_type": "Value"
},
"text": {
"dtype": "string",
"id": null,
"_type": "Value"
}
}