Context Window Limits When Using a GPU on Cloud Run, Part 2

2025/08/17 17:21:22

Cloud Run で GPU を使うときの context window の限界その2

English follows Japanese.

結論

Cloud Run で Gemma 3 の 4b, 12b の各モデルと、NVIDIA L4 GPU を組み合わせたサービスを作成してアクセスした場合、num_ctx がそれぞれ 95934, 28064 までは GPU のみで処理できていると思われる。

llama3-gradient:8b では、ログ出力を元に見ると、21503 から 22528 の間で GPU のみで処理できる範囲を超えると考えられた。ログ出力だけを見た場合、モデルが変わったことで、より大きな num_ctx が GPU のみで処理できるようになった可能性がある。また、測定上の性能限界は「20970 辺りか、その少し手前に性能限界があると言って良さそう」と結論付けていた。12b のモデルはより大きなモデルにもかかわらず、GPU のみで処理できる範囲が広くなっており、Gemma 3 が優位である可能性がある。

はじめに

前回記事 Cloud Run で GPU を使うときの context window の限界その1 では、モデルとして llama3-gradient:8b を使用し、Cloud Run での GPU 利用に関する調査結果を紹介した。

llama3-gradient:8b では、

ログ出力を元に見ると、21503 から 22528 の間で GPU のみで処理できる範囲を超える
20970 辺りか、その少し手前に性能限界があると言って良さそう

と考えられた。

本記事では、Gemma 3 の gemma3:4b 及び gemma3:12b を使用し、同様の調査を行った結果を紹介する。

詳細

Dockerfile は以下のようにした。

FROM ollama/ollama
ENV OLLAMA_HOST 0.0.0.0:8080
ENV OLLAMA_MODELS /models
ENV OLLAMA_DEBUG false
ENV OLLAMA_KEEP_ALIVE -1 
ENV MODEL1 gemma3:4b
ENV MODEL2 gemma3:12b
RUN ollama serve & sleep 5 && ollama pull $MODEL1 && ollama pull $MODEL2
ENTRYPOINT ["ollama", "serve"]

こうして作成したイメージに対し、以下の設定で Cloud Run サービスを作成した。

Startup CPU boost: Enabled
Concurrency: 1
CPU limit: 8
Memory limit: 32 GiB
GPU: 1 NVIDIA L4 (no zonal redundancy)
OLLAMA_NUM_PARALLEL: 4

OLLAMA_NUM_PARALLEL については1の方が良かったかもしれないが、前回同様の値とした。Concurrency については、1コンテナ1アクセスにしたかったため、1を指定した。

num_ctx を変える様子や、コードサンプルは前回記事と同様であるため、省略する。

`gemma3:4b`

Gemma 3 モデルにおいては、

time=2025-08-11T15:01:46.131Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=GPU-093094b0-2571-df4b-e150-74f8cbee7791 parallel=4 available=23603838976 required="7.4 GiB"
time=2025-08-11T15:01:46.213Z level=INFO source=server.go:135 msg="system memory" total="31.3 GiB" free="30.5 GiB" free_swap="0 B"
time=2025-08-11T15:01:46.215Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.4 GiB" memory.required.partial="7.4 GiB" memory.required.kv="1.7 GiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-08-11T15:01:46.318Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 62048 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 36847"
time=2025-08-11T15:01:46.319Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-11T15:01:46.319Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-11T15:01:46.319Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-11T15:01:46.336Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-11T15:01:46.338Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:36847"
time=2025-08-11T15:01:46.436Z level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36
time=2025-08-11T15:01:46.571Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L4, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so
time=2025-08-11T15:01:47.630Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:365 msg="offloading 34 repeating layers to GPU"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:376 msg="offloaded 35/35 layers to GPU"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="3.1 GiB"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.0 MiB"

のようなアウトプットがあるが、num_ctx が増えると以下のように変わることがあった。

以下は、102秒かかった、gemma3:4b で、num_ctx が 220000 のときの例である:

time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:365 msg="offloading 22 repeating layers to GPU"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:376 msg="offloaded 22/35 layers to GPU"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="2.4 GiB"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="1.2 GiB"
time=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"
time=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="9.0 GiB"
time=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB"

"offloaded 22/35 layers to GPU" の部分が 35/35 にならない
"offloading output layer to" の後が CPU になる
model weights の buffer=CPU において size が 525.0 MiB 以外の値をとる
compute graph の 2回目の backend=CPU buffer_type=CPU において size が 5.0 MiB 以外の値をとる
compute graph の 2回目の backend=CUDA0 buffer_type=CUDA0 において、size が 1.1 GiB 以外の値をとる

ようになったのだ。ここでは、まず、

offloading output layer to の後が CPU であるか、layer が 35/35 GPU ではない
model weights の方について 525.0 MiB より大きな値をとる

うちで、最小の num_ctx を探索することにした。

95934 11.46秒（一部のみ抜粋）

time=2025-08-11T16:09:20.035Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383736 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 41987"
time=2025-08-11T16:09:20.228Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T16:09:20.301Z level=INFO source=ggml.go:365 msg="offloading 34 repeating layers to GPU"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:376 msg="offloaded 35/35 layers to GPU"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="3.1 GiB"
time=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="6.6 GiB"
time=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.0 MiB"

95936 15.71秒（一部のみ抜粋）

time=2025-08-11T16:09:46.928Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383744 --batch-size 512 --n-gpu-layers 34 --threads 4 --parallel 4 --port 44603"
time=2025-08-11T16:09:47.143Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:365 msg="offloading 34 repeating layers to GPU"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:376 msg="offloaded 34/35 layers to GPU"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.8 GiB"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="1.8 GiB"
time=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"
time=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="6.6 GiB"
time=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"

`gemma3:12b`

GPU のみの上限
28062 41.2秒（一部のみ抜粋）
49/49 かつ、output layer も GPU で、model weights CPU は 787.5 MiB

time=2025-08-11T17:03:10.202Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.1 GiB" memory.required.partial="21.1 GiB" memory.required.kv="8.3 GiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="3.7 GiB" memory.graph.partial="4.5 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-08-11T17:03:10.296Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112248 --batch-size 512 --n-gpu-layers 49 --threads 4 --parallel 4 --port 42727"
time=2025-08-11T17:03:10.513Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:365 msg="offloading 48 repeating layers to GPU"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:376 msg="offloaded 49/49 layers to GPU"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="787.5 MiB"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="7.6 GiB"
time=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="3.7 GiB"
time=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"

CPU を使い始める状態
28064 23.5秒（一部のみ抜粋）

48/49 になり、output layer が CPU で、model weights CPU は 2.3 GiB

time=2025-08-11T17:12:17.380Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.0 GiB" memory.required.partial="19.4 GiB" memory.required.kv="8.3 GiB" memory.required.allocations="[19.4 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="3.7 GiB" memory.graph.partial="4.5 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-08-11T17:12:17.477Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112256 --batch-size 512 --n-gpu-layers 48 --threads 4 --parallel 4 --port 38065"
time=2025-08-11T17:12:17.705Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:365 msg="offloading 48 repeating layers to GPU"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:376 msg="offloaded 48/49 layers to GPU"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="2.3 GiB"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="6.0 GiB"
time=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"
time=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="3.7 GiB"
time=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"

所要時間は逆転してしまっているが、ログから明らかに、GPU のみの処理から、CPU も使い始める状態に変化していることがわかる。

まとめ

Cloud Run で gemma3:4b, gemma3:12b の各モデルと、NVIDIA L4 GPU を組み合わせたサービスを作成してアクセスした場合、num_ctx がそれぞれ 95934, 28064 までは GPU のみで処理ができていそうであることが確認できた。

ただし、細かな原因は不明であるが、CPU を使い始めるからといって、必ずしもすぐに処理時間が長くなるわけではないこともわかった。

参考

Cloud Run で GPU を使うときの context window の限界その1
Release Notes 2025-04-07 Configuring GPU in your Cloud Run service is now generally available (GA). https://cloud.google.com/run/docs/release-notes#April_07_2025
Cloud Run GPU + Ollama gemma2 のパフォーマンスを図ってみる佐藤慧太 https://zenn.dev/satohjohn/articles/912b4c718a8d74
Google が Gemma 3 をリリース！-Cloud Run で動かしてみた- 村松 https://zenn.dev/cloud_ace/articles/cloud-run-gpu-ollama-2

Context Window Limits When Using a GPU on Cloud Run, Part 2

Conclusion

When creating and accessing a service on Cloud Run combining the 4b and 12b models of Gemma 3 with an NVIDIA L4 GPU, it is thought that processing can be done only by the GPU up to a num_ctx of 95934 and 28064, respectively.

With llama3-gradient:8b, based on the log output, it was considered that the range processable only by the GPU is exceeded between 21503 and 22528. Looking only at the log output, it’s possible that changing the model allowed a larger num_ctx to be processed only by the GPU. Also, the measured performance limit was concluded to be "it seems safe to say that the performance limit is around 20,970 or slightly below it." Despite the 12b model being a larger model, the range that can be processed only by the GPU has become wider, and it is possible that Gemma 3 has an advantage.

Introduction

In the previous article Context window limits when using a GPU with Cloud Run Part 1, I introduced the investigation results regarding GPU usage on Cloud Run using llama3-gradient:8b as the model.

With llama3-gradient:8b, it was thought that:

Based on the log output, it appears that the processing exceeds what the GPU alone can handle somewhere between num_ctx values of 21,503 and 22,528
It seems safe to say that the performance limit is around 20,970 or slightly below it

In this article, I will introduce the results of a similar investigation using Gemma 3’s gemma3:4b and gemma3:12b.

Details

The Dockerfile was as follows.

FROM ollama/ollama
ENV OLLAMA_HOST 0.0.0.0:8080
ENV OLLAMA_MODELS /models
ENV OLLAMA_DEBUG false
ENV OLLAMA_KEEP_ALIVE -1 
ENV MODEL1 gemma3:4b
ENV MODEL2 gemma3:12b
RUN ollama serve & sleep 5 && ollama pull $MODEL1 && ollama pull $MODEL2
ENTRYPOINT ["ollama", "serve"]

For the image created this way, a Cloud Run service was created with the following settings.

Startup CPU boost: Enabled
Concurrency: 1
CPU limit: 8
Memory limit: 32 GiB
GPU: 1 NVIDIA L4 (no zonal redundancy)
OLLAMA_NUM_PARALLEL: 4

Regarding OLLAMA_NUM_PARALLEL, 1 might have been better, but I used the same value as last time. For Concurrency, I specified 1 because I wanted one access per container.

The process of changing num_ctx and the code samples are the same as in the previous article, so they are omitted.

`gemma3:4b`

In the Gemma 3 model, there is output like this:

time=2025-08-11T15:01:46.131Z level=INFO source=sched.go:786 msg="new model will fit in available VRAM in single GPU, loading" model=/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=GPU-093094b0-2571-df4b-e150-74f8cbee7791 parallel=4 available=23603838976 required="7.4 GiB"
time=2025-08-11T15:01:46.213Z level=INFO source=server.go:135 msg="system memory" total="31.3 GiB" free="30.5 GiB" free_swap="0 B"
time=2025-08-11T15:01:46.215Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=35 layers.offload=35 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.4 GiB" memory.required.partial="7.4 GiB" memory.required.kv="1.7 GiB" memory.required.allocations="[7.4 GiB]" memory.weights.total="2.3 GiB" memory.weights.repeating="1.8 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.6 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-08-11T15:01:46.318Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 62048 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 36847"
time=2025-08-11T15:01:46.319Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-08-11T15:01:46.319Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-08-11T15:01:46.319Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-08-11T15:01:46.336Z level=INFO source=runner.go:925 msg="starting ollama engine"
time=2025-08-11T15:01:46.338Z level=INFO source=runner.go:983 msg="Server listening on 127.0.0.1:36847"
time=2025-08-11T15:01:46.436Z level=INFO source=ggml.go:92 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36
time=2025-08-11T15:01:46.571Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA L4, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /usr/lib/ollama/libggml-cuda.so
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so
time=2025-08-11T15:01:47.630Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:365 msg="offloading 34 repeating layers to GPU"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:376 msg="offloaded 35/35 layers to GPU"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="3.1 GiB"
time=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.0 MiB"

but as num_ctx increases, it would sometimes change as follows.

The following is an example with gemma3:4b when num_ctx is 220000, which took 102 seconds:

time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:365 msg="offloading 22 repeating layers to GPU"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:376 msg="offloaded 22/35 layers to GPU"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="2.4 GiB"
time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="1.2 GiB"
time=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"
time=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="9.0 GiB"
time=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="8.0 GiB"

The "offloaded 22/35 layers to GPU" part does not become 35/35
What follows "offloading output layer to" becomes CPU
In model weights’ buffer=CPU, the size takes a value other than 525.0 MiB
In the second compute graph’s backend=CPU buffer_type=CPU, the size takes a value other than 5.0 MiB
In the second compute graph’s backend=CUDA0 buffer_type=CUDA0, the size takes a value other than 1.1 GiB

This is what started to happen. Here, first, I decided to search for the minimum num_ctx where either:

What follows 'offloading output layer to’ is CPU, or the layer is not 35/35 GPU
Regarding model weights, it takes a value larger than 525.0 MiB

95934 11.46 seconds (excerpt only)

time=2025-08-11T16:09:20.035Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383736 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 41987"
time=2025-08-11T16:09:20.228Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T16:09:20.301Z level=INFO source=ggml.go:365 msg="offloading 34 repeating layers to GPU"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:376 msg="offloaded 35/35 layers to GPU"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="525.0 MiB"
time=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="3.1 GiB"
time=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="6.6 GiB"
time=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="5.0 MiB"

95936 15.71 seconds (excerpt only)

time=2025-08-11T16:09:46.928Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383744 --batch-size 512 --n-gpu-layers 34 --threads 4 --parallel 4 --port 44603"
time=2025-08-11T16:09:47.143Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:365 msg="offloading 34 repeating layers to GPU"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:376 msg="offloaded 34/35 layers to GPU"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="1.8 GiB"
time=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="1.8 GiB"
time=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"
time=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="6.6 GiB"
time=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"

`gemma3:12b`

GPU-only limit
28062 41.2 seconds (excerpt only)
49/49 and the output layer is also on the GPU, and model weights CPU is 787.5 MiB

time=2025-08-11T17:03:10.202Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.1 GiB" memory.required.partial="21.1 GiB" memory.required.kv="8.3 GiB" memory.required.allocations="[21.1 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="3.7 GiB" memory.graph.partial="4.5 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-08-11T17:03:10.296Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112248 --batch-size 512 --n-gpu-layers 49 --threads 4 --parallel 4 --port 42727"
time=2025-08-11T17:03:10.513Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:365 msg="offloading 48 repeating layers to GPU"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:371 msg="offloading output layer to GPU"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:376 msg="offloaded 49/49 layers to GPU"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="787.5 MiB"
time=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="7.6 GiB"
time=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="1.1 GiB"
time=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="0 B"
time=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="3.7 GiB"
time=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="7.5 MiB"

State where CPU starts to be used
28064 23.5 seconds (excerpt only)

Becomes 48/49, the output layer is on the CPU, and model weights CPU is 2.3 GiB

time=2025-08-11T17:12:17.380Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split="" memory.available="[22.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.0 GiB" memory.required.partial="19.4 GiB" memory.required.kv="8.3 GiB" memory.required.allocations="[19.4 GiB]" memory.weights.total="6.8 GiB" memory.weights.repeating="6.0 GiB" memory.weights.nonrepeating="787.5 MiB" memory.graph.full="3.7 GiB" memory.graph.partial="4.5 GiB" projector.weights="795.9 MiB" projector.graph="1.0 GiB"
time=2025-08-11T17:12:17.477Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/bin/ollama runner --ollama-engine --model /models/blobs/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112256 --batch-size 512 --n-gpu-layers 48 --threads 4 --parallel 4 --port 38065"
time=2025-08-11T17:12:17.705Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:365 msg="offloading 48 repeating layers to GPU"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:369 msg="offloading output layer to CPU"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:376 msg="offloaded 48/49 layers to GPU"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg="model weights" buffer=CPU size="2.3 GiB"
time=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg="model weights" buffer=CUDA0 size="6.0 GiB"
time=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="0 B"
time=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"
time=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 size="3.7 GiB"
time=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg="compute graph" backend=CPU buffer_type=CPU size="1.1 GiB"

Although the required time has reversed, it is clear from the logs that the state is changing from GPU-only processing to a state that also starts using the CPU.

Summary

When creating and accessing a service on Cloud Run combining the gemma3:4b and gemma3:12b models with an NVIDIA L4 GPU, it was confirmed that processing seems to be possible with only the GPU for num_ctx up to 95934 and 28064, respectively.

However, although the detailed cause is unknown, it was also found that starting to use the CPU does not necessarily mean the processing time will immediately become longer.

References

Context Window Limits When Using a GPU on Cloud Run, Part 1
Release Notes 2025-04-07 Configuring GPU in your Cloud Run service is now generally available (GA). https://cloud.google.com/run/docs/release-notes#April_07_2025
Measuring the performance of Cloud Run GPU + Ollama gemma2 by Keita Sato https://zenn.dev/satohjohn/articles/912b4c718a8d74
Google Releases Gemma 3! -Tried Running It on Cloud Run- by Muramatsu. https://zenn.dev/cloud_ace/articles/cloud-run-gpu-ollama-2

Cloud Run,google cloud

Posted by tako

Dynamically Managing Google Groups Members with Cloud Identity Groups API

Context Window Limits When Using a GPU on Cloud Run, Part 1

コメント一覧

まだ、コメントがありません

コメントする

Cloud Run で GPU を使うときの context window の限界 その2

結論

はじめに

詳細

gemma3:4b

gemma3:12b

まとめ

参考

Context Window Limits When Using a GPU on Cloud Run, Part 2

Conclusion

Introduction

Details

gemma3:4b

gemma3:12b

Summary

References

Cloud Run で GPU を使うときの context window の限界その2

`gemma3:4b`

`gemma3:12b`

`gemma3:4b`

`gemma3:12b`