研究成果
开放数据集
  • DuReader
    DuReader version 2.0 contains more than 300K questions, 1.4M evidence documents and 660K human generated answers. It can be used to train or evaluate MRC models and systems.
  • ACL2019-ARNOR
    ARNOR Dataset is a large manually labeled sentence-level test set for distant supervision relation classification. It contains 3,192 sentences and 9,051 instances for 11 relation types (including "None" type), and is carefully annotated to ensure accuracy.
  • DuConv
    Proactive human-machine conversation is a new conversation task, which aims to build a human-like conversational agent endowed with the abilityof proactively leading the conversation, such as introducing a new topic or maintaining the current topic.
  • BSTC
    BSTC (Baidu Speech Translation Corpus) is a large-scale dataset for automatic simultaneous interpretation. BSTC version 1.0 contains 50 hours of real speeches, including three parts, the audio files, the transcripts, and the translations.
  • DuEL
    A large-scale corpus of Chinese short-texts for entity recognition and linking tasks. It contains 100K annotated short text, and corresponding mention and links to entities in Baidu Knowledge Base.
  • DuIE
    DuIE Dataset is a large-scale human annotated dataset, with more than 410,000 SPO triples in over 200,000 real-world Chinese sentences, bounded by a pre-specified schema with 50 types of predicates.
  • BROAD
    BROAD (Baidu Research Open-Access Dataset) is designed to help institutions and individual developers train their models to accelerate the research on machine reading comprehension, autonomous cars, visual cognition and other Al related fields.