
"Without the MoE forcing, the overall inference time and token generation speed cratered; the model could barely manage an average of 1.5 tokens per second even for simple queries."
"With MoE forcing turned on (with the maximum number of layers supported, 30), token generation speed jumped to anywhere from 5 to 13 tokens per second, depending on the rest of the system's load."
"For faster time-to-first-token results, you can disable thinking, at the possible cost of less robust output."
"The resulting code and explanation was not significantly more advanced or detailed than the non-thinking version."
Gemma 4 utilizes a mixture of experts design to enhance model performance, particularly for those exceeding VRAM limits. An experimental setting in LM Studio allows users to allocate MoE weights to the CPU, conserving VRAM and improving inference speed. Without this feature, the model struggles with token generation, averaging only 1.5 tokens per second. With MoE forcing enabled, speeds increase to 5-13 tokens per second. Disabling thinking can yield faster results, but may compromise output quality, as demonstrated in code-generation queries.
Read at InfoWorld
Unable to calculate read time
Collection
[
|
...
]