Post

Review on Elastic On-Device LLM Service

Review on Elastic On-Device LLM Service

Motivation

Different LLM requests need different levels of LLM services, such as time-to-first-token (TTFT) and time-per-output-token (TPOT).

For instance, voice assistants require low TTFT to provide a responsive user experience. Meanwhile, UI-automation needs low TTFT and acceptable TPOT to ensure smooth interactions.

Nowadays a common practice is to deploy a single yet powerful LLM as a general task solver for multiple requests. But they lacks a elasticity to serve requests that have diversified service-level objectives (SLOs) on inference latancy.

Other solutions include deploying multiple LLMs with different sizes and capabilities, but it can be costly and inefficient.

Design

This method involves two novel sights:

  • Ont-shot reordering of permutation consistent units
  • Dual-head tiny language model for prompt-model orchestration

One-shot reordering of permutation consistent units

Dual-head tiny language model for prompt-model orchestration

This post is licensed under CC BY 4.0 by the author.