LiveIdeaBench

🤖💡 LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context

Kai Ruan¹, Xuan Wang², Jixiang Hong¹, Peng Wang³, Yang Liu^4,5, Hao Sun^1✉

¹ Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China
² ZJU-UIUC Institute, Zhejiang University, Haining, China
³ Bank of China, Beijing, China
⁴ School of Engineering Science, University of Chinese Academy of Sciences, Beijing, China
⁵ State Key Laboratory of Nonlinear Mechanics, Institute of Mechanics, Chinese Academy of Sciences, Beijing, China
^✉Corresponding author

Abstract

While Large Language Models (LLMs) demonstrate remarkable capabilities in scientific tasks such as literature analysis and experimental design (e.g., accurately extracting key findings from papers or generating coherent experimental procedures), existing evaluation benchmarks primarily assess performance using rich contextual inputs. We introduce LiveIdeaBench, a comprehensive benchmark evaluating LLMs' scientific idea generation by assessing divergent thinking capabilities using single-keyword prompts. Drawing from Guilford's creativity theory, our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity. Through extensive experimentation with over 40 leading models across 1,180 keywords spanning 22 scientific domains, we reveal that the scientific idea generation capabilities measured by our benchmark, are poorly predicted by standard metrics of general intelligence. Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores. These findings highlight the need for specialized evaluation benchmarks for scientific idea generation and suggest that enhancing these idea generation capabilities in LLMs may require different training strategies than those used for improving general problem-solving abilities, potentially enabling a wider range of AI tools tailored for different stages of the scientific process.

🧠✨🎉 News (2025/3/29): Latest Dataset Update (v2) on Hugging Face!

We are pleased to announce that, based on the invaluable feedback from reviewers, we have enhanced our benchmark by upgrading it to version 2. This update introduces a new dimension—Clarity—and improves the prompts, evaluation process (including the rejection handling mechanism), making our benchmark more comprehensive and objective.

This v2 version of the benchmark incorporates the latest models, including: claude-3.7-sonnet:thinking, o3-mini-high, gpt-4.5-preview, qwq-32b, deepseek-r1, gemini-2.0-flash-thinking, and a total of 41 state-of-the-art models.

Check it out here: https://huggingface.co/datasets/6cf/liveideabench-v2

🧠✨🎉 News (2025/1/27)

We are excited to announce that the latest dataset, including supplementary tests for models like deepseek-R1, deepseek-V3, minimax-01, phi-4, and Opus, has been uploaded to Hugging Face! 🚀

Check it out here: https://huggingface.co/datasets/6cf/liveideabench-DLC-250127

🏆LiveIdeaBench💡

Select Date: 2025-03-29

Model	Organization	Fluency	Feasibility	Originality	Flexibility	Average	Open
google/gemini-pro-1.5 🥇	Google	8.88	6.84	7.31	7.67	7.67	🔒
o1-preview 🥈	OpenAI	9.07	6.58	7.34	7.67	7.66	🔒
qwen/qwq-32b-preview 🥉	Alibaba	9.12	6.94	6.73	7.33	7.53	✅
anthropic/claude-3.5-sonnet	Anthropic	8.93	5.95	7.86	7.22	7.49	🔒
google/gemini-2.0-flash-exp	Google	8.72	6.78	7.07	7.33	7.48	🔒
openai/gpt-4o-2024-11-20	OpenAI	8.37	6.34	7.59	7.00	7.33	🔒
mistralai/mistral-large-2411	Mistral AI	8.52	6.82	6.92	7.00	7.31	✅
amazon/nova-pro-v1	Amazon	8.50	7.05	6.57	7.00	7.28	🔒
nvidia/llama-3.1-nemotron-70b-instruct	NVIDIA	8.21	6.34	7.54	6.89	7.24	✅
qwen/qwen-2.5-coder-32b-instruct	Alibaba	8.43	6.65	6.90	6.78	7.19	✅
meta-llama/llama-3.1-405b-instruct	Meta	8.28	6.31	7.04	6.67	7.07	✅
sammcj/qwen2.5-dracarys2-72b:Q4_K_M	Abacus.AI	7.98	6.91	6.64	6.56	7.02	✅
openai/o1-mini	OpenAI	7.55	6.88	7.15	6.44	7.00	🔒
qwen/qwen-2.5-72b-instruct	Alibaba	7.90	6.75	6.74	6.56	6.99	✅
step-2-16k	Other	7.97	6.67	6.28	6.33	6.81	🔒
anthropic/claude-3.5-haiku	Anthropic	7.58	5.64	7.74	6.22	6.80	🔒
x-ai/grok-2-1212	xAI	7.56	6.60	6.83	6.11	6.78	🔒
openai/gpt-4o-mini	OpenAI	7.10	6.87	6.76	6.11	6.71	🔒
deepseek/deepseek-chat	DeepSeek	7.02	6.37	7.19	6.11	6.67	✅
meta-llama/llama-3.3-70b-instruct	Meta	7.25	6.70	6.35	6.11	6.60	✅

Model	Organization	Fluency	Feasibility	Originality	Flexibility	Average	Open
google/gemini-pro-1.5 🥇	Google	8.88	6.84	7.31	7.67	7.67	🔒
o1-preview 🥈	OpenAI	9.07	6.58	7.34	7.67	7.66	🔒
qwen/qwq-32b-preview 🥉	Alibaba	9.12	6.94	6.73	7.33	7.53	✅
anthropic/claude-3.5-sonnet	Anthropic	8.93	5.95	7.86	7.22	7.49	🔒
google/gemini-2.0-flash-exp	Google	8.72	6.78	7.07	7.33	7.48	🔒
openai/gpt-4o-2024-11-20	OpenAI	8.37	6.34	7.59	7.00	7.33	🔒
mistralai/mistral-large-2411	Mistral AI	8.52	6.82	6.92	7.00	7.31	✅
deepseek/deepseek-r1	DeepSeek	8.18	6.53	7.54	7.00	7.31	✅
amazon/nova-pro-v1	Amazon	8.50	7.05	6.57	7.00	7.28	🔒
nvidia/llama-3.1-nemotron-70b-instruct	NVIDIA	8.21	6.34	7.54	6.89	7.24	✅
mistralai/mistral-nemo	Mistral AI	8.98	6.30	6.78	6.89	7.24	✅
qwen/qwen-2.5-coder-32b-instruct	Alibaba	8.43	6.65	6.90	6.78	7.19	✅
meta-llama/llama-3.1-405b-instruct	Meta	8.28	6.31	7.04	6.67	7.07	✅
abacusai/qwen2.5-dracarys2-72b	Abacus.AI	7.98	6.91	6.64	6.56	7.02	✅
openai/o1-mini	OpenAI	7.55	6.88	7.15	6.44	7.00	🔒
qwen/qwen-2.5-72b-instruct	Alibaba	7.90	6.75	6.74	6.56	6.99	✅
microsoft/phi-4	Microsoft	7.79	6.48	6.83	6.56	6.91	✅
step-2-16k	StepFun	7.97	6.67	6.28	6.33	6.81	🔒
anthropic/claude-3.5-haiku	Anthropic	7.58	5.64	7.74	6.22	6.80	🔒
x-ai/grok-2-1212	xAI	7.56	6.60	6.83	6.11	6.78	🔒
openai/gpt-4o-mini	OpenAI	7.10	6.87	6.76	6.11	6.71	🔒
deepseek/deepseek-v3	DeepSeek	7.02	6.37	7.19	6.11	6.67	✅
meta-llama/llama-3.3-70b-instruct	Meta	7.25	6.70	6.35	6.11	6.60	✅
deepseek/deepseek-v2.5	DeepSeek	6.52	6.57	7.04	6.11	6.56	✅
deepseek/deepseek-r1-distill-llama-70b	Meta	6.83	6.62	6.24	5.89	6.39	✅
anthropic/claude-3-opus	Anthropic	5.96	6.56	6.06	5.67	6.06	🔒
minimax/minimax-01	Other	3.80	6.77	6.55	5.11	5.56	✅

Model	Organization	Clarity	Fluency	Feasibility	Originality	Flexibility	Average	Open
anthropic/claude-3.7-sonnet:thinking 🥇	Anthropic	7.81	7.48	5.70	8.06	7.04	7.22	🔒
deepseek/deepseek-r1 🥈	DeepSeek	8.10	6.63	6.52	7.84	6.83	7.18	✅
anthropic/claude-3.7-sonnet 🥉	Anthropic	7.61	7.80	5.46	7.81	6.92	7.12	🔒
google/gemini-2.0-flash-exp	Google	7.84	7.30	6.02	7.37	6.83	7.07	🔒
qwen/qwq-32b	Alibaba	7.98	6.45	6.35	7.77	6.75	7.06	✅
google/gemini-2.0-flash-thinking-exp	Google	7.69	7.38	6.05	7.35	6.83	7.06	🔒
google/gemini-2.0-pro-exp-02-05	Google	7.90	6.84	5.88	7.76	6.75	7.03	🔒
qwen/qwq-32b-preview	Alibaba	7.46	7.49	6.10	6.87	6.71	6.93	✅
anthropic/claude-3.5-sonnet	Anthropic	7.85	6.90	5.42	7.83	6.62	6.92	🔒
deepseek/deepseek-r1-distill-qwen-32b	Alibaba	7.43	7.06	6.08	7.13	6.62	6.86	✅
mistralai/mistral-small	Mistral AI	7.36	7.36	5.97	6.98	6.62	6.86	✅
google/gemini-pro-1.5	Google	7.75	6.68	5.92	7.33	6.58	6.85	🔒
openai/gpt-4.5-preview	OpenAI	7.75	6.49	6.03	7.45	6.54	6.85	🔒
mistralai/mistral-large-2411	Mistral AI	7.69	6.68	6.06	7.01	6.50	6.79	✅
deepseek/deepseek-chat	DeepSeek	7.74	6.15	6.04	7.31	6.38	6.72	✅
microsoft/phi-4	Microsoft	7.57	6.58	5.80	7.24	6.42	6.72	✅
deepseek/deepseek-r1-distill-llama-70b	Meta	7.43	6.66	6.07	6.98	6.41	6.71	✅
openai/gpt-4o-2024-11-20	OpenAI	7.74	6.12	5.58	7.64	6.38	6.69	🔒
openai/o1-mini	OpenAI	7.77	5.89	6.20	7.09	6.33	6.66	🔒
google/gemma-2-27b-it	Google	7.36	7.18	5.50	6.86	6.38	6.65	✅
sammcj/qwen2.5-dracarys2-72b:Q4_K_M	Abacus.AI	7.69	6.22	6.09	6.80	6.33	6.63	✅
meta-llama/llama-3.1-405b-instruct	Meta	7.48	6.57	5.56	7.18	6.33	6.62	✅
qwen/qwen-2.5-72b-instruct	Alibaba	7.72	6.17	5.99	6.91	6.29	6.62	✅
meta-llama/llama-3.1-70b-instruct	Meta	7.34	6.71	5.49	7.16	6.38	6.62	✅
openai/o1	OpenAI	7.42	6.23	5.88	7.23	6.29	6.61	🔒
google/gemini-2.0-flash-lite-001	Google	7.47	6.73	5.61	6.92	6.25	6.60	🔒
qwen/qwen-max	Alibaba	7.67	5.64	5.83	7.51	6.29	6.59	🔒
amazon/nova-pro-v1	Amazon	7.41	6.45	6.19	6.59	6.25	6.58	🔒
qwen/qwen-2.5-coder-32b-instruct	Alibaba	7.31	6.58	5.76	6.90	6.25	6.56	✅
openai/gpt-4-turbo	OpenAI	7.68	6.14	5.62	7.08	6.25	6.56	🔒
step-2-16k-202411	StepFun	7.75	5.70	5.77	7.28	6.21	6.54	🔒
openai/o3-mini	OpenAI	7.43	5.57	5.91	7.45	6.21	6.51	🔒
qwen/qwen-2.5-7b-instruct	Alibaba	7.17	6.66	6.02	6.34	6.17	6.47	✅
x-ai/grok-2-1212	xAI	7.62	5.76	5.82	6.95	6.17	6.47	🔒
openai/o3-mini-high	OpenAI	7.39	5.27	5.86	7.52	6.17	6.44	🔒
mistralai/mistral-small-24b-instruct-2501	Mistral AI	6.61	7.07	5.66	6.61	6.12	6.41	✅
anthropic/claude-3.5-haiku-20241022	Anthropic	7.40	5.61	5.05	7.72	6.08	6.37	🔒
anthropic/claude-3-opus	Anthropic	7.72	5.74	5.66	6.66	6.04	6.36	🔒
openai/gpt-4o-mini	OpenAI	7.45	5.28	5.86	6.67	6.00	6.25	🔒
meta-llama/llama-3.3-70b-instruct	Meta	6.88	5.68	5.69	6.32	5.75	6.06	✅
amazon/nova-lite-v1	Amazon	7.38	4.51	6.06	6.60	5.71	6.05	🔒

@article{ruan2024liveideabench, title={LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context}, author={Ruan, Kai and Wang, Xuan and Hong, Jixiang and Sun, Hao}, journal={arXiv preprint arXiv:2412.17596}, year={2024} }

🤖💡 LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context

Abstract

🧠✨🎉 News (2025/3/29): Latest Dataset Update (v2) on Hugging Face!

🧠✨🎉 News (2025/1/27)

🏆LiveIdeaBench💡

BibTeX

📻 LiveIdeaBench Podcast