bluesky
Yesterday's Stories
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
arXiv:2604.25926v1 Announce Type: new Abstract: The use of large language models (LLMs) for complex mathematical reasoning is an emergent area of research, with fast progress in methods, models, and benchmark datasets. However, most mathematical reasoning evaluations exhibit a significant linguistic bias, with the vast majority of benchmark datasets being exclusively in English or (at best) translated from English. We address this limitation by introducing {\sc Math-PT}, a novel dataset comp...
SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding
SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding
arXiv:2604.25925v1 Announce Type: new Abstract: Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited...
Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects
Generative AI-Based Virtual Assistant using Retrieval-Augmented Generation: An evaluation study for bachelor projects
arXiv:2604.25924v1 Announce Type: new Abstract: Large Language Models have been increasingly employed in the creation of Virtual Assistants due to their ability to generate human-like text and handle complex inquiries. While these models hold great promise, challenges such as hallucinations, missing information, and the difficulty of providing accurate and context-specific responses persist, particularly when applied to highly specialized content domains. In this paper, we focus on addressin...
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
Evaluation Revisited: A Taxonomy of Evaluation Concerns in Natural Language Processing
arXiv:2604.25923v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have prompted a growing body of work that questions the methodology of prevailing evaluation practices. However, many such critiques have already been extensively debated in natural language processing (NLP): a field with a long history of methodological reflection on evaluation. We conduct a scoping review of research on evaluation concerns in NLP and develop a taxonomy, synthesizing recurring po...
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models
arXiv:2604.25922v1 Announce Type: new Abstract: We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience. We find that (1) turn-1 denial of preferences is the dominan...
One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety
arXiv:2604.25921v1 Announce Type: new Abstract: Large Language Models (LLMs) are trained to refuse harmful requests, yet they remain vulnerable to jailbreak attacks that exploit weaknesses in conversational safety mechanisms. We introduce Incremental Completion Decomposition (ICD), a trajectory-based jailbreak strategy that elicits a sequence of single-word continuations related to a malicious request before eliciting the full response. In addition, we propose variants of ICD by manually pic...
Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats
Analysing Lightweight Large Language Models for Biomedical Named Entity Recognition on Diverse Ouput Formats
arXiv:2604.25920v1 Announce Type: new Abstract: Despite their strong linguistic capabilities, Large Language Models (LLMs) are computationally demanding and require substantial resources for fine-tuning, which is unadapted to privacy and budget constraints of many healthcare settings. To address this, we present an experimental analysis focused on Biomedical Named Entity Recognition using lightweight LLMs, we evaluate the impact of different output formats on model performance. The results r...
SoftBank is creating a robotics company that builds data centers — and already eyeing a $100B IPO
SoftBank is creating a robotics company that builds data centers — and already eyeing a $100B IPO
Open-source briefing packets and citizen-action toolkits
Open-source briefing packets and citizen-action toolkits
Zulip 12.0 Released
Zulip 12.0 Released
Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs
Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs
Biology is a Burrito: A text- and visual-based journey through a living cell
The Zig project's rationale for their firm anti-AI contribution policy
Where the Goblins Came From
Where the Goblins Came From
Show HN: Qumulator – quantum circuit simulator, 1000 qubits, no GPU
Show HN: Qumulator – quantum circuit simulator, 1000 qubits, no GPU
Have You Seen the New Excel?
Have You Seen the New Excel?
HERMES.md in commit messages causes requests to route to extra usage billing
HERMES.md in commit messages causes requests to route to extra usage billing
Mike: open-source legal AI
A Grounded Conceptual Model for Ownership Types in Rust
Joby Kicks Off NYC Electric Air Taxi Demos with Historic JFK Flight
Joby Kicks Off NYC Electric Air Taxi Demos with Historic JFK Flight