Cartero

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese

Tech Blogs Arxiv LLM Evaluation by Tiago Teixeira, Ana Carolina Erthal, Juan Belieni, Beatriz Canaverde, Diego Mesquita, Miguel Faria, Eliezer de Souza da Silva, Andr\'e F. T. Martins 1 day ago

arXiv:2604.25926v1 Announce Type: new Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comp...

SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

Tech Blogs Arxiv Large Language Models by Yijun Lin, Jinhao Sheng, Qingyue Cai, Feng Zhou 1 day ago

arXiv:2604.25925v1 Announce Type: new Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited...

Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects

Tech Blogs Arxiv LLM Evaluation by Dumitru Ver\c{s}ebeniuc, Martijn Elands, Sara Falahatkar, Chiara Magrone, Mohammad Falah, Martijn Bouss\'e, Aki H\"arm\"a 1 day ago

arXiv:2604.25924v1 Announce Type: new Abstract: Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context-specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressin...

Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing

Tech Blogs Arxiv LLM Evaluation by Ruchira Dhar, Anders S{\o}gaard 1 day ago

arXiv:2604.25923v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring po...

Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

Tech Blogs Arxiv AI Psychosis by Skylar DeTure 1 day ago

arXiv:2604.25922v1 Announce Type: new Abstract: We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience. We find that (1) turn-1 denial of preferences is the dominan...

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Tech Blogs Arxiv Prompt Engineering by Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea 1 day ago

arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually pic...

Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats

Tech Blogs Arxiv Fine-tuning and PEFT 1 day ago

arXiv:2604.25920v1 Announce Type: new Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results r...