arxiv:2505.02819

ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

Published on Feb 19

· Submitted by

Ivan Sedykh on May 6, 2025

Upvote

Authors:

Dmitriy Shopkhoev ,

Ammar Ali ,

Valentin Malykh ,

Stamatios Lefkimmiatis ,

Sergey Zagoruyko

Abstract

ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations using calibration data, achieving high compression ratios with minimal performance loss.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe

View arXiv page View PDF GitHub 43 Add to collection

Community

idsedykh

Paper submitter May 6, 2025

This comment has been hidden

stefan-it

May 6, 2025

I found their repo here: https://github.com/mts-ai/ReplaceMe :)

razzant

May 10, 2025

Nice work! Very close to what was suggested in "Your Transformer is Secretly Linear"
https://huggingface.co/papers/2405.12250

ammarali32

Paper author May 10, 2025

Thank you very much for your comment and for drawing our attention to the work “Transformer is Secretly Linear.” We appreciate your suggestion to consider this related literature.

To clarify the distinction as we understand it, “Transformer is Secretly Linear” explores replacing individual transformer blocks with linear layers, applies a different metric to select layers, and adopts a unique approach to estimating the linear transformations based on numerical solution.

By contrast, in our work, we focus on replacing a group of consecutive blocks—selected specifically based on the Cosine distance criterion. Notably, we introduce the linear transformation only on the MLP side, ensuring that it can be merged into the original architecture seamlessly. Additionally, our paper offers both analytical and numerical results, accompanied by a detailed analysis of the type of linear transformation employed (please see the supplementary materials for further details).

We are grateful for your suggestion and will certainly include a discussion of “Transformer is Secretly Linear” in the revised version of our related work section.

Thank you again for your valuable feedback.