🤖💡 LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context

Kai Ruan1, Xuan Wang2, Jixiang Hong1, Hao Sun1✉
1 Renmin University of China 2 Zhejiang University
Corresponding author

Abstract

While Large Language Models (LLMs) have demonstrated remarkable capabilities in scientific tasks, existing evaluation frameworks primarily assess their performance using rich contextual inputs, overlooking their ability to generate novel ideas from minimal information. We introduce LiveIdeaBench, a comprehensive benchmark that evaluates LLMs' scientific creativity and divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility. Through extensive experimentation with 20 leading models across 1,180 keywords spanning 18 scientific domains, we reveal that scientific creative ability shows distinct patterns from general intelligence metrics. Notably, our results demonstrate that models like QwQ-32b-preview achieve comparable creative performance to top-tier models like o1-preview, despite significant gaps in their general intelligence scores. These findings highlight the importance of specialized evaluation frameworks for scientific creativity and suggest that the development of creative capabilities in LLMs may follow different trajectories than traditional problem-solving abilities.

🧠✨🎉 News (2025/3/29): Latest Dataset Update (v2) on Hugging Face!

We are pleased to announce that, based on the invaluable feedback from reviewers, we have enhanced our benchmark by upgrading it to version 2. This update introduces a new dimension—Clarity—and improves the prompts, evaluation process (including the rejection handling mechanism), making our benchmark more comprehensive and objective.

This v2 version of the benchmark incorporates the latest models, including: claude-3.7-sonnet:thinking, o3-mini-high, gpt-4.5-preview, qwq-32b, deepseek-r1, gemini-2.0-flash-thinking, and a total of 41 state-of-the-art models.

Check it out here: https://huggingface.co/datasets/6cf/liveideabench-v2

🧠✨🎉 News (2025/1/27)

We are excited to announce that the latest dataset, including supplementary tests for models like deepseek-R1, deepseek-V3, minimax-01, phi-4, and Opus, has been uploaded to Hugging Face! 🚀

Check it out here: https://huggingface.co/datasets/6cf/liveideabench-DLC-250127

🏆LiveIdeaBench💡

Model Organization Fluency Feasibility Originality Flexibility Average Open
google/gemini-pro-1.5 🥇 Google 8.88 6.84 7.31 7.67 7.67 🔒
o1-preview 🥈 OpenAI 9.07 6.58 7.34 7.67 7.66 🔒
qwen/qwq-32b-preview 🥉 Alibaba 9.12 6.94 6.73 7.33 7.53
anthropic/claude-3.5-sonnet Anthropic 8.93 5.95 7.86 7.22 7.49 🔒
google/gemini-2.0-flash-exp Google 8.72 6.78 7.07 7.33 7.48 🔒
openai/gpt-4o-2024-11-20 OpenAI 8.37 6.34 7.59 7.00 7.33 🔒
mistralai/mistral-large-2411 Mistral AI 8.52 6.82 6.92 7.00 7.31
amazon/nova-pro-v1 Amazon 8.50 7.05 6.57 7.00 7.28 🔒
nvidia/llama-3.1-nemotron-70b-instruct NVIDIA 8.21 6.34 7.54 6.89 7.24
qwen/qwen-2.5-coder-32b-instruct Alibaba 8.43 6.65 6.90 6.78 7.19
meta-llama/llama-3.1-405b-instruct Meta 8.28 6.31 7.04 6.67 7.07
sammcj/qwen2.5-dracarys2-72b:Q4_K_M Abacus.AI 7.98 6.91 6.64 6.56 7.02
openai/o1-mini OpenAI 7.55 6.88 7.15 6.44 7.00 🔒
qwen/qwen-2.5-72b-instruct Alibaba 7.90 6.75 6.74 6.56 6.99
step-2-16k Other 7.97 6.67 6.28 6.33 6.81 🔒
anthropic/claude-3.5-haiku Anthropic 7.58 5.64 7.74 6.22 6.80 🔒
x-ai/grok-2-1212 xAI 7.56 6.60 6.83 6.11 6.78 🔒
openai/gpt-4o-mini OpenAI 7.10 6.87 6.76 6.11 6.71 🔒
deepseek/deepseek-chat DeepSeek 7.02 6.37 7.19 6.11 6.67
meta-llama/llama-3.3-70b-instruct Meta 7.25 6.70 6.35 6.11 6.60

BibTeX

@article{ruan2024liveideabench,
title={LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context},
author={Ruan, Kai and Wang, Xuan and Hong, Jixiang and Sun, Hao},
journal={arXiv preprint arXiv:2412.17596},
year={2024}
}