Google's Gemma 4 shines on local systems - both big and small

"Without the MoE forcing, the overall inference time and token generation speed cratered; the model could barely manage an average of 1.5 tokens per second even for simple queries."

"With MoE forcing turned on (with the maximum number of layers supported, 30), token generation speed jumped to anywhere from 5 to 13 tokens per second, depending on the rest of the system's load."

"For faster time-to-first-token results, you can disable thinking, at the possible cost of less robust output."

"The resulting code and explanation was not significantly more advanced or detailed than the non-thinking version."

Gemma 4 utilizes a mixture of experts design to enhance model performance, particularly for those exceeding VRAM limits. An experimental setting in LM Studio allows users to allocate MoE weights to the CPU, conserving VRAM and improving inference speed. Without this feature, the model struggles with token generation, averaging only 1.5 tokens per second. With MoE forcing enabled, speeds increase to 5-13 tokens per second. Disabling thinking can yield faster results, but may compromise output quality, as demonstrated in code-generation queries.

#gemma-4 #mixture-of-experts #token-generation #performance-optimization #vram-management

Read at InfoWorld

Unable to calculate read time

Collection

[

...

]

Google's Gemma 4 shines on local systems - both big and smallGoogle's Gemma 4 shines on local systems - both big and small Briefly

Google's Gemma 4 shines on local systems - both big and small
Google's Gemma 4 shines on local systems - both big and small
Briefly