Deployment and Performance Evaluation of Dynamic Mixture-of-Experts on Embedded GPUs


This thesis explores the deployment of Dynamic Mixture-of-Experts (MoE) models on embedded GPUs, focusing on how dynamic computation allocation affects model performance. Unlike traditional neural networks with fixed computation paths, Dynamic MoE can adjust its computational load at runtime. For instance, simple queries to a language model may require fewer active experts, while complex queries engage more experts, optimizing both efficiency and performance. Moreover, a runtime scheduler can also set dynamic computation budgets (e.g., latency and memory bandwidth) for MoE models given different system conditions.

This study investigates how varying computation budgets impact both model accuracy and inference latency. A higher computation budget may improve accuracy by enabling more experts to process complex inputs, but it also increases latency and resource usage. Conversely, a stricter budget may limit computational resources, reducing latency at the potential cost of lower accuracy. By analyzing these trade-offs, the research aims to determine optimal strategies for balancing efficiency and performance in real-time AI applications. 

Requirements

Basic knowledge in

- LLMs

- Mixture-of-Experts

- AI model deployment

- embedded systems

- solid programming background

Please send your application email with cv and transcripts (in English) to binqi.sun@tum.de. 

(Students from CIT are welcome to apply for this topic as an IDP or Master's thesis)

Thesis Type

Bachelorarbeit | Semesterarbeit | Masterarbeit

Contact

Binqi Sun

Gebäude 5501 Raum 2.102a

+49 (89) 289 - 55183

binqi.sun@tum.de