Skip to content

hejunqing/MADial-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MADial-Bench

Repo for the paper, MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation. NAACL 2025.

Here we introduce a benchmark for memory-augmented chatbot by proposing two-stage memory recall and multi-recall paradigm based on cognitive science and psychological theory.

two-stage memory-augmented chatbot

The dialogue and memory data for testing are in data directory.

  • "relevant-id" in data/en/MADial-Bench-en-memory.json and data/en/MADial-Bench-en-memory.json,are index of related memories in memory archieve.
  • "test-turn" in data/en/MADial-Bench-en-dialogue.json, and data/zh/MADial-Bench-zh-dialogue.json,the turns to be predicted.

The embeddings of dialogues and memories in memory recall task are in embeddings directory.

The inference results from LLM models in memory recognition and response genereation task are in output directory.

Evaluation results are in results directory with annotation files in annotatation directory. The critiera and guidelines are

You can already make use of the benchmark if you read the paper and have strong coding ability.

Usage & start up

For memory recall task:

  1. First download embeddings models and save them in pretrained_models.
  2. Then run Embeddings.py to generate embeddings for dialogues and memories.
  3. Run embeddings_top_20_new.py to get top 20 candidates.
  4. Run embedding_scores_new.py to calculate scores of certain metrics.

For memory recognition and response generation task:

  1. for English version, run make_setting_candidates.py to generate dialogues for setting2 and 3.
  2. First download opensourced LLM and save them in pretrained_models. If you usage API, then skip.
  3. change the code in infer_setting1/2/3_en/ch.py to load your LLM, then run the infer program.
  4. copy the output file path and change the path in evaluate.py to run automatic evaluation. It is not reliable, we recommand you to run step 4.
  5. Human evaluation. Criteria and guidelines are in annotation directory. If you want to try LLM as judge, please try. We find the LLMs (up to 2024.10.25) are unable to do such a careful job.

Generate a file from the inference results of different LLMs:

  1. run make_annotation_candidates.py to group the dialogus and the responses togather.
  2. run prepare_anno.py to sample certain amount of dialogues from the test set and form an excel.

The codes need to be rewritten if you need one-click start codes. Sorry for the inconvenience and I will tidy them up ASAP.

Please feel free to ask any questions and report issues.

Please Cite

@inproceedings{he-etal-2025-madial,
    title = "{MAD}ial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation",
    author = "He, Junqing  and
      Zhu, Liang  and
      Wang, Rui  and
      Wang, Xi  and
      Haffari, Gholamreza  and
      Zhang, Jiaxing",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.499/",
    pages = "9902--9921",
    ISBN = "979-8-89176-189-6"
}


@misc{he2024madialbenchrealworldevaluationmemoryaugmented,
      title={MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation}, 
      author={Junqing He and Liang Zhu and Rui Wang and Xi Wang and Reza Haffari and Jiaxing Zhang},
      year={2024},
      eprint={2409.15240},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.15240}, 
}

About

Repo for NAACL paper, MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages