Wolfram Computation Meets Knowledge

Wolfram LLM
Benchmarking Project

Using Wolfram Language to benchmark the performance of major LLMs

As major users and analyzers of large language model (LLM) technology, we've been continually tracking the performance of LLMs. This project involves releasing our ongoing results, initially for a specific well-characterized code generation task.

The task consists of going from English-language specifications to Wolfram Language code. The test cases are exercises from Stephen Wolfram's An Elementary Introduction to the Wolfram Language. These exercises have been done online by millions of humans, and we've developed effective tools for determining functional correctness of code, which we're now applying to LLMs.

This table and previous versions are available in computable form in the Wolfram Data Repository.

Find out how Wolfram Language can enhance your LLM results.

For LLM developers: contact us for the dataset and tools or to arrange for your LLM to be included.