数据集

百度NLP一直致力于开源数据集的建设,从2017年起,已经对外发布了10余个自然语言处理相关的数据集。这些数据集涵盖了自然语言处理应用的多个方向,包括了问答、翻译、对话以及信息抽取等。他们的共同特点是:源自真实应用、提供了面向真实应用的挑战、数据规模大。百度NLP期望将在真实工业应用中积累的数据开源,帮助提升行业整体能力,共同推动中文自然语言处理技术的进步。

开放数据集
  • BSTC

    BSTC (Baidu Speech Translation Corpus) is a large-scale dataset for automatic simultaneous interpretation. BSTC version 1.0 contains 50 hours of real speeches, including three parts, the audio files, the transcripts, and the translations.
    More >
  • DuConv

    Proactive human-machine conversation is a new conversation task, which aims to build a human-like conversational agent endowed with the abilityof proactively leading the conversation, such as introducing a new topic or maintaining the current topic.
    More >
  • DuEE

    DuEE is a large-scale general-purpose Chinese data set for event extraction. It consists of 17,000 sentences containing 20,000 event of 65 event types and corresponding human annotated arguments.
    More >
  • DuEL

    A large-scale corpus of Chinese short-texts for entity recognition and linking tasks. It contains 100K annotated short text, and corresponding mention and links to entities in Baidu Knowledge Base.
    More >
  • DuIE

    DuIE Dataset is a large-scale human annotated dataset, with more than 410,000 SPO triples in over 200,000 real-world Chinese sentences, bounded by a pre-specified schema with 50 types of predicates.
    More >
  • DuReader

    DuReader version 2.0 contains more than 300K questions, 1.4M evidence documents and 660K human generated answers. It can be used to train or evaluate MRC models and systems.
    More >
  • DuRecDial

    Conversational recommendation over multi-type dialogs is a novel task, where the bot can naturally switch conversation scenarios from chitchat or QA to a recommendation dialog. To facilitate the study of this task, we create a human-to-human recommendation oriented multi-type Chinese dialog dataset (DuRecDial).
  • DuSQL

    DuSQL is a large-scale and pragmatic Chinese dataset for the cross-domain text-to-SQL task, containing 200 databases, 813 tables, and 23,797 question/SQL pairs. It can be used to train or evaluate text-to-SQL models and systems.
    More >
  • ACL2019-ARNOR

    ARNOR Dataset is a large manually labeled sentence-level test set for distant supervision relation classification. It contains 3,192 sentences and 9,051 instances for 11 relation types (including "None" type), and is carefully annotated to ensure accuracy.
    More >
  • BROAD

    BROAD (Baidu Research Open-Access Dataset) is designed to help institutions and individual developers train their models to accelerate the research on machine reading comprehension, autonomous cars, visual cognition and other Al related fields.
    More >
  • BSTC

    BSTC (Baidu Speech Translation Corpus) is a large-scale dataset for automatic simultaneous interpretation. BSTC version 1.0 contains 50 hours of real speeches, including three parts, the audio files, the transcripts, and the translations.
    More >
  • DuConv

    Proactive human-machine conversation is a new conversation task, which aims to build a human-like conversational agent endowed with the abilityof proactively leading the conversation, such as introducing a new topic or maintaining the current topic.
    More >
  • DuEE

    DuEE is a large-scale general-purpose Chinese data set for event extraction. It consists of 17,000 sentences containing 20,000 event of 65 event types and corresponding human annotated arguments.
    More >
展开更多
千言数据集(LUGE)
  • 千言数据集(Language Understanding and Generation Evaluation Benchmarks — LUGE)

    千言是面向自然语言处理的中文开源数据共建项目。该项目由百度联合中国计算机学会自然语言处理专委会、中国中文信息学会评测工作委员会共同发起,与来自国内多家高校和企业的数据资源研发者共同建设。千言的目标是覆盖丰富的任务类型,从语义理解、知识融合、多模态融合等角度推动技术的进步;同时,提供多维度综合评价的数据集,覆盖评价模型的全面性、泛化性、鲁棒性等。
    More >