New AI benchmarking tools evaluate real world performance



However, said Agrawal, while it’s relatively easy to evaluate models on math or coding tasks, “assessing models in subjective areas such as reasoning is much more challenging. Reasoning models can be applied across a wide variety of contexts, and models may specialize in particular domains. In such cases, the necessary subjectivity is difficult to capture with any benchmark. Moreover, this approach requires frequent updates and expert input, which may be difficult to maintain and scale.”

Biases, he added, “may also creep into the evaluation, depending on the domain and geographic background of the experts. Overall, xbench is a strong first step, and over time, it may become the foundation for evaluating the practical impact and market readiness of AI agents.”

Hyoun Park, CEO and chief analyst at Amalgam Insights, has some concerns. “The effort to keep AI benchmarks up-to-date and to improve them over time is a welcome one, because dynamic benchmarks are necessary in a market where models are changing on a monthly or even weekly basis,” he said. “But my caveat is that AI benchmarks need to both be updated over time and actually change over time.”



Source link