{"id":614,"date":"2025-08-17T17:21:22","date_gmt":"2025-08-17T08:21:22","guid":{"rendered":"https:\/\/tako.nakano.net\/blog\/?p=614"},"modified":"2025-08-17T17:21:22","modified_gmt":"2025-08-17T08:21:22","slug":"cloud-run-gpu-context-part-2","status":"publish","type":"post","link":"https:\/\/tako.nakano.net\/blog\/2025\/08\/cloud-run-gpu-context-part-2\/","title":{"rendered":"Context Window Limits When Using a GPU on Cloud Run, Part 2"},"content":{"rendered":"<h1>Cloud Run \u3067 GPU \u3092\u4f7f\u3046\u3068\u304d\u306e context window \u306e\u9650\u754c \u305d\u306e2<\/h1>\n<p>English follows Japanese.<\/p>\n<h2>\u7d50\u8ad6<\/h2>\n<p>Cloud Run \u3067 <a href=\"https:\/\/ollama.com\/library\/gemma3\">Gemma 3<\/a> \u306e 4b, 12b \u306e\u5404\u30e2\u30c7\u30eb\u3068\u3001NVIDIA L4 GPU \u3092\u7d44\u307f\u5408\u308f\u305b\u305f\u30b5\u30fc\u30d3\u30b9\u3092\u4f5c\u6210\u3057\u3066\u30a2\u30af\u30bb\u30b9\u3057\u305f\u5834\u5408\u3001<code>num_ctx<\/code> \u304c\u305d\u308c\u305e\u308c 95934, 28064 \u307e\u3067\u306f GPU \u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u3066\u3044\u308b\u3068\u601d\u308f\u308c\u308b\u3002<\/p>\n<p><code>llama3-gradient:8b<\/code> \u3067\u306f\u3001\u30ed\u30b0\u51fa\u529b\u3092\u5143\u306b\u898b\u308b\u3068\u300121503 \u304b\u3089 22528 \u306e\u9593\u3067 GPU \u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u308b\u7bc4\u56f2\u3092\u8d85\u3048\u308b\u3068\u8003\u3048\u3089\u308c\u305f\u3002\u30ed\u30b0\u51fa\u529b\u3060\u3051\u3092\u898b\u305f\u5834\u5408\u3001\u30e2\u30c7\u30eb\u304c\u5909\u308f\u3063\u305f\u3053\u3068\u3067\u3001\u3088\u308a\u5927\u304d\u306a <code>num_ctx<\/code> \u304c GPU \u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u308b\u3088\u3046\u306b\u306a\u3063\u305f\u53ef\u80fd\u6027\u304c\u3042\u308b\u3002\u307e\u305f\u3001\u6e2c\u5b9a\u4e0a\u306e\u6027\u80fd\u9650\u754c\u306f\u300c20970 \u8fba\u308a\u304b\u3001\u305d\u306e\u5c11\u3057\u624b\u524d\u306b\u6027\u80fd\u9650\u754c\u304c\u3042\u308b\u3068\u8a00\u3063\u3066\u826f\u3055\u305d\u3046\u300d\u3068\u7d50\u8ad6\u4ed8\u3051\u3066\u3044\u305f\u300212b \u306e\u30e2\u30c7\u30eb\u306f\u3088\u308a\u5927\u304d\u306a\u30e2\u30c7\u30eb\u306b\u3082\u304b\u304b\u308f\u3089\u305a\u3001GPU \u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u308b\u7bc4\u56f2\u304c\u5e83\u304f\u306a\u3063\u3066\u304a\u308a\u3001Gemma 3 \u304c\u512a\u4f4d\u3067\u3042\u308b\u53ef\u80fd\u6027\u304c\u3042\u308b\u3002<\/p>\n<h2>\u306f\u3058\u3081\u306b<\/h2>\n<p>\u524d\u56de\u8a18\u4e8b <a href=\"https:\/\/tako.nakano.net\/blog\/2025\/08\/cloud-run-gpu-context-part-1\/\">Cloud Run \u3067 GPU \u3092\u4f7f\u3046\u3068\u304d\u306e context window \u306e\u9650\u754c \u305d\u306e1<\/a> \u3067\u306f\u3001\u30e2\u30c7\u30eb\u3068\u3057\u3066 <code>llama3-gradient:8b<\/code> \u3092\u4f7f\u7528\u3057\u3001Cloud Run \u3067\u306e GPU \u5229\u7528\u306b\u95a2\u3059\u308b\u8abf\u67fb\u7d50\u679c\u3092\u7d39\u4ecb\u3057\u305f\u3002<\/p>\n<p><code>llama3-gradient:8b<\/code> \u3067\u306f\u3001<\/p>\n<ol>\n<li>\u30ed\u30b0\u51fa\u529b\u3092\u5143\u306b\u898b\u308b\u3068\u300121503 \u304b\u3089 22528 \u306e\u9593\u3067 GPU \u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u308b\u7bc4\u56f2\u3092\u8d85\u3048\u308b<\/li>\n<li>20970 \u8fba\u308a\u304b\u3001\u305d\u306e\u5c11\u3057\u624b\u524d\u306b\u6027\u80fd\u9650\u754c\u304c\u3042\u308b\u3068\u8a00\u3063\u3066\u826f\u3055\u305d\u3046<\/li>\n<\/ol>\n<p>\u3068\u8003\u3048\u3089\u308c\u305f\u3002<\/p>\n<p>\u672c\u8a18\u4e8b\u3067\u306f\u3001Gemma 3 \u306e <code>gemma3:4b<\/code> \u53ca\u3073 <code>gemma3:12b<\/code> \u3092\u4f7f\u7528\u3057\u3001\u540c\u69d8\u306e\u8abf\u67fb\u3092\u884c\u3063\u305f\u7d50\u679c\u3092\u7d39\u4ecb\u3059\u308b\u3002<\/p>\n<h2>\u8a73\u7d30<\/h2>\n<p><code>Dockerfile<\/code> \u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u3057\u305f\u3002<\/p>\n<pre><code class=\"language-Dockerfile:Dockerfile\">FROM ollama\/ollama\nENV OLLAMA_HOST 0.0.0.0:8080\nENV OLLAMA_MODELS \/models\nENV OLLAMA_DEBUG false\nENV OLLAMA_KEEP_ALIVE -1 \nENV MODEL1 gemma3:4b\nENV MODEL2 gemma3:12b\nRUN ollama serve &amp; sleep 5 &amp;&amp; ollama pull $MODEL1 &amp;&amp; ollama pull $MODEL2\nENTRYPOINT [&quot;ollama&quot;, &quot;serve&quot;]<\/code><\/pre>\n<p>\u3053\u3046\u3057\u3066\u4f5c\u6210\u3057\u305f\u30a4\u30e1\u30fc\u30b8\u306b\u5bfe\u3057\u3001\u4ee5\u4e0b\u306e\u8a2d\u5b9a\u3067 Cloud Run \u30b5\u30fc\u30d3\u30b9\u3092\u4f5c\u6210\u3057\u305f\u3002<\/p>\n<ul>\n<li>Startup CPU boost: Enabled<\/li>\n<li>Concurrency: 1<\/li>\n<li>CPU limit: 8<\/li>\n<li>Memory limit: 32 GiB<\/li>\n<li>GPU: 1 NVIDIA L4 (no zonal redundancy)<\/li>\n<li><code>OLLAMA_NUM_PARALLEL<\/code>: 4<\/li>\n<\/ul>\n<p><code>OLLAMA_NUM_PARALLEL<\/code> \u306b\u3064\u3044\u3066\u306f1\u306e\u65b9\u304c\u826f\u304b\u3063\u305f\u304b\u3082\u3057\u308c\u306a\u3044\u304c\u3001\u524d\u56de\u540c\u69d8\u306e\u5024\u3068\u3057\u305f\u3002Concurrency \u306b\u3064\u3044\u3066\u306f\u30011\u30b3\u30f3\u30c6\u30ca1\u30a2\u30af\u30bb\u30b9\u306b\u3057\u305f\u304b\u3063\u305f\u305f\u3081\u30011\u3092\u6307\u5b9a\u3057\u305f\u3002<\/p>\n<p><code>num_ctx<\/code> \u3092\u5909\u3048\u308b\u69d8\u5b50\u3084\u3001\u30b3\u30fc\u30c9\u30b5\u30f3\u30d7\u30eb\u306f\u524d\u56de\u8a18\u4e8b\u3068\u540c\u69d8\u3067\u3042\u308b\u305f\u3081\u3001\u7701\u7565\u3059\u308b\u3002<\/p>\n<h3><code>gemma3:4b<\/code><\/h3>\n<p>Gemma 3 \u30e2\u30c7\u30eb\u306b\u304a\u3044\u3066\u306f\u3001<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T15:01:46.131Z level=INFO source=sched.go:786 msg=&quot;new model will fit in available VRAM in single GPU, loading&quot; model=\/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=GPU-093094b0-2571-df4b-e150-74f8cbee7791 parallel=4 available=23603838976 required=&quot;7.4 GiB&quot;\ntime=2025-08-11T15:01:46.213Z level=INFO source=server.go:135 msg=&quot;system memory&quot; total=&quot;31.3 GiB&quot; free=&quot;30.5 GiB&quot; free_swap=&quot;0 B&quot;\ntime=2025-08-11T15:01:46.215Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=35 layers.offload=35 layers.split=&quot;&quot; memory.available=&quot;[22.0 GiB]&quot; memory.gpu_overhead=&quot;0 B&quot; memory.required.full=&quot;7.4 GiB&quot; memory.required.partial=&quot;7.4 GiB&quot; memory.required.kv=&quot;1.7 GiB&quot; memory.required.allocations=&quot;[7.4 GiB]&quot; memory.weights.total=&quot;2.3 GiB&quot; memory.weights.repeating=&quot;1.8 GiB&quot; memory.weights.nonrepeating=&quot;525.0 MiB&quot; memory.graph.full=&quot;1.1 GiB&quot; memory.graph.partial=&quot;1.6 GiB&quot; projector.weights=&quot;795.9 MiB&quot; projector.graph=&quot;1.0 GiB&quot;\ntime=2025-08-11T15:01:46.318Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 62048 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 36847&quot;\ntime=2025-08-11T15:01:46.319Z level=INFO source=sched.go:481 msg=&quot;loaded runners&quot; count=1\ntime=2025-08-11T15:01:46.319Z level=INFO source=server.go:598 msg=&quot;waiting for llama runner to start responding&quot;\ntime=2025-08-11T15:01:46.319Z level=INFO source=server.go:632 msg=&quot;waiting for server to become available&quot; status=&quot;llm server not responding&quot;\ntime=2025-08-11T15:01:46.336Z level=INFO source=runner.go:925 msg=&quot;starting ollama engine&quot;\ntime=2025-08-11T15:01:46.338Z level=INFO source=runner.go:983 msg=&quot;Server listening on 127.0.0.1:36847&quot;\ntime=2025-08-11T15:01:46.436Z level=INFO source=ggml.go:92 msg=&quot;&quot; architecture=gemma3 file_type=Q4_K_M name=&quot;&quot; description=&quot;&quot; num_tensors=883 num_key_values=36\ntime=2025-08-11T15:01:46.571Z level=INFO source=server.go:632 msg=&quot;waiting for server to become available&quot; status=&quot;llm server loading model&quot;\nggml_cuda_init: GGML_CUDA_FORCE_MMQ: no\nggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no\nggml_cuda_init: found 1 CUDA devices:\nDevice 0: NVIDIA L4, compute capability 8.9, VMM: yes\nload_backend: loaded CUDA backend from \/usr\/lib\/ollama\/libggml-cuda.so\nload_backend: loaded CPU backend from \/usr\/lib\/ollama\/libggml-cpu-skylakex.so\ntime=2025-08-11T15:01:47.630Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:365 msg=&quot;offloading 34 repeating layers to GPU&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:371 msg=&quot;offloading output layer to GPU&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:376 msg=&quot;offloaded 35\/35 layers to GPU&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;3.1 GiB&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;525.0 MiB&quot;\ntime=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;0 B&quot;\ntime=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;5.0 MiB&quot;<\/code><\/pre>\n<p>\u306e\u3088\u3046\u306a\u30a2\u30a6\u30c8\u30d7\u30c3\u30c8\u304c\u3042\u308b\u304c\u3001<code>num_ctx<\/code> \u304c\u5897\u3048\u308b\u3068\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u5909\u308f\u308b\u3053\u3068\u304c\u3042\u3063\u305f\u3002<\/p>\n<p>\u4ee5\u4e0b\u306f\u3001102\u79d2\u304b\u304b\u3063\u305f\u3001<code>gemma3:4b<\/code> \u3067\u3001<code>num_ctx<\/code> \u304c 220000 \u306e\u3068\u304d\u306e\u4f8b\u3067\u3042\u308b:<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:365 msg=&quot;offloading 22 repeating layers to GPU&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:369 msg=&quot;offloading output layer to CPU&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:376 msg=&quot;offloaded 22\/35 layers to GPU&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;2.4 GiB&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;1.2 GiB&quot;\ntime=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;0 B&quot;\ntime=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;\ntime=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;9.0 GiB&quot;\ntime=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;8.0 GiB&quot;<\/code><\/pre>\n<ol>\n<li>&quot;offloaded 22\/35 layers to GPU&quot; \u306e\u90e8\u5206\u304c 35\/35 \u306b\u306a\u3089\u306a\u3044<\/li>\n<li>&quot;offloading output layer to&quot; \u306e\u5f8c\u304c CPU \u306b\u306a\u308b<\/li>\n<li>model weights \u306e buffer=CPU \u306b\u304a\u3044\u3066 size \u304c 525.0 MiB \u4ee5\u5916\u306e\u5024\u3092\u3068\u308b<\/li>\n<li>compute graph \u306e 2\u56de\u76ee\u306e backend=CPU buffer_type=CPU \u306b\u304a\u3044\u3066 size \u304c 5.0 MiB \u4ee5\u5916\u306e\u5024\u3092\u3068\u308b<\/li>\n<li>compute graph \u306e 2\u56de\u76ee\u306e backend=CUDA0 buffer_type=CUDA0 \u306b\u304a\u3044\u3066\u3001size \u304c 1.1 GiB \u4ee5\u5916\u306e\u5024\u3092\u3068\u308b<\/li>\n<\/ol>\n<p>\u3088\u3046\u306b\u306a\u3063\u305f\u306e\u3060\u3002\u3053\u3053\u3067\u306f\u3001\u307e\u305a\u3001<\/p>\n<ol>\n<li>offloading output layer to \u306e\u5f8c\u304c CPU \u3067\u3042\u308b\u304b\u3001layer \u304c 35\/35 GPU \u3067\u306f\u306a\u3044<\/li>\n<li>model weights \u306e\u65b9\u306b\u3064\u3044\u3066 525.0 MiB \u3088\u308a\u5927\u304d\u306a\u5024\u3092\u3068\u308b<\/li>\n<\/ol>\n<p>\u3046\u3061\u3067\u3001\u6700\u5c0f\u306e <code>num_ctx<\/code> \u3092\u63a2\u7d22\u3059\u308b\u3053\u3068\u306b\u3057\u305f\u3002<\/p>\n<p>95934 11.46\u79d2\uff08\u4e00\u90e8\u306e\u307f\u629c\u7c8b\uff09<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T16:09:20.035Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383736 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 41987&quot;\ntime=2025-08-11T16:09:20.228Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T16:09:20.301Z level=INFO source=ggml.go:365 msg=&quot;offloading 34 repeating layers to GPU&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:371 msg=&quot;offloading output layer to GPU&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:376 msg=&quot;offloaded 35\/35 layers to GPU&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;525.0 MiB&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;3.1 GiB&quot;\ntime=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;0 B&quot;\ntime=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;6.6 GiB&quot;\ntime=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;5.0 MiB&quot;<\/code><\/pre>\n<p>95936 15.71\u79d2\uff08\u4e00\u90e8\u306e\u307f\u629c\u7c8b\uff09<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T16:09:46.928Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383744 --batch-size 512 --n-gpu-layers 34 --threads 4 --parallel 4 --port 44603&quot;\ntime=2025-08-11T16:09:47.143Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:365 msg=&quot;offloading 34 repeating layers to GPU&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:369 msg=&quot;offloading output layer to CPU&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:376 msg=&quot;offloaded 34\/35 layers to GPU&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;1.8 GiB&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;1.8 GiB&quot;\ntime=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;0 B&quot;\ntime=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;\ntime=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;6.6 GiB&quot;\ntime=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;<\/code><\/pre>\n<h3><code>gemma3:12b<\/code><\/h3>\n<p>GPU \u306e\u307f\u306e\u4e0a\u9650<br \/>\n28062 41.2\u79d2\uff08\u4e00\u90e8\u306e\u307f\u629c\u7c8b\uff09<br \/>\n49\/49 \u304b\u3064\u3001output layer \u3082 GPU \u3067\u3001model weights CPU \u306f 787.5 MiB<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T17:03:10.202Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=&quot;&quot; memory.available=&quot;[22.0 GiB]&quot; memory.gpu_overhead=&quot;0 B&quot; memory.required.full=&quot;21.1 GiB&quot; memory.required.partial=&quot;21.1 GiB&quot; memory.required.kv=&quot;8.3 GiB&quot; memory.required.allocations=&quot;[21.1 GiB]&quot; memory.weights.total=&quot;6.8 GiB&quot; memory.weights.repeating=&quot;6.0 GiB&quot; memory.weights.nonrepeating=&quot;787.5 MiB&quot; memory.graph.full=&quot;3.7 GiB&quot; memory.graph.partial=&quot;4.5 GiB&quot; projector.weights=&quot;795.9 MiB&quot; projector.graph=&quot;1.0 GiB&quot;\ntime=2025-08-11T17:03:10.296Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112248 --batch-size 512 --n-gpu-layers 49 --threads 4 --parallel 4 --port 42727&quot;\ntime=2025-08-11T17:03:10.513Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:365 msg=&quot;offloading 48 repeating layers to GPU&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:371 msg=&quot;offloading output layer to GPU&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:376 msg=&quot;offloaded 49\/49 layers to GPU&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;787.5 MiB&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;7.6 GiB&quot;\ntime=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;0 B&quot;\ntime=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;3.7 GiB&quot;\ntime=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;7.5 MiB&quot;<\/code><\/pre>\n<p>CPU \u3092\u4f7f\u3044\u59cb\u3081\u308b\u72b6\u614b<br \/>\n28064 23.5\u79d2\uff08\u4e00\u90e8\u306e\u307f\u629c\u7c8b\uff09<\/p>\n<p>48\/49 \u306b\u306a\u308a\u3001output layer \u304c CPU \u3067\u3001model weights CPU \u306f 2.3 GiB<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T17:12:17.380Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split=&quot;&quot; memory.available=&quot;[22.0 GiB]&quot; memory.gpu_overhead=&quot;0 B&quot; memory.required.full=&quot;22.0 GiB&quot; memory.required.partial=&quot;19.4 GiB&quot; memory.required.kv=&quot;8.3 GiB&quot; memory.required.allocations=&quot;[19.4 GiB]&quot; memory.weights.total=&quot;6.8 GiB&quot; memory.weights.repeating=&quot;6.0 GiB&quot; memory.weights.nonrepeating=&quot;787.5 MiB&quot; memory.graph.full=&quot;3.7 GiB&quot; memory.graph.partial=&quot;4.5 GiB&quot; projector.weights=&quot;795.9 MiB&quot; projector.graph=&quot;1.0 GiB&quot;\ntime=2025-08-11T17:12:17.477Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112256 --batch-size 512 --n-gpu-layers 48 --threads 4 --parallel 4 --port 38065&quot;\ntime=2025-08-11T17:12:17.705Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:365 msg=&quot;offloading 48 repeating layers to GPU&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:369 msg=&quot;offloading output layer to CPU&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:376 msg=&quot;offloaded 48\/49 layers to GPU&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;2.3 GiB&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;6.0 GiB&quot;\ntime=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;0 B&quot;\ntime=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;\ntime=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;3.7 GiB&quot;\ntime=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;<\/code><\/pre>\n<p>\u6240\u8981\u6642\u9593\u306f\u9006\u8ee2\u3057\u3066\u3057\u307e\u3063\u3066\u3044\u308b\u304c\u3001\u30ed\u30b0\u304b\u3089\u660e\u3089\u304b\u306b\u3001GPU \u306e\u307f\u306e\u51e6\u7406\u304b\u3089\u3001CPU \u3082\u4f7f\u3044\u59cb\u3081\u308b\u72b6\u614b\u306b\u5909\u5316\u3057\u3066\u3044\u308b\u3053\u3068\u304c\u308f\u304b\u308b\u3002<\/p>\n<h2>\u307e\u3068\u3081<\/h2>\n<p>Cloud Run \u3067 <code>gemma3:4b<\/code>, <code>gemma3:12b<\/code> \u306e\u5404\u30e2\u30c7\u30eb\u3068\u3001NVIDIA L4 GPU \u3092\u7d44\u307f\u5408\u308f\u305b\u305f\u30b5\u30fc\u30d3\u30b9\u3092\u4f5c\u6210\u3057\u3066\u30a2\u30af\u30bb\u30b9\u3057\u305f\u5834\u5408\u3001<code>num_ctx<\/code> \u304c\u305d\u308c\u305e\u308c 95934, 28064 \u307e\u3067\u306f GPU \u306e\u307f\u3067\u51e6\u7406\u304c\u3067\u304d\u3066\u3044\u305d\u3046\u3067\u3042\u308b\u3053\u3068\u304c\u78ba\u8a8d\u3067\u304d\u305f\u3002<\/p>\n<p>\u305f\u3060\u3057\u3001\u7d30\u304b\u306a\u539f\u56e0\u306f\u4e0d\u660e\u3067\u3042\u308b\u304c\u3001CPU \u3092\u4f7f\u3044\u59cb\u3081\u308b\u304b\u3089\u3068\u3044\u3063\u3066\u3001\u5fc5\u305a\u3057\u3082\u3059\u3050\u306b\u51e6\u7406\u6642\u9593\u304c\u9577\u304f\u306a\u308b\u308f\u3051\u3067\u306f\u306a\u3044\u3053\u3068\u3082\u308f\u304b\u3063\u305f\u3002<\/p>\n<h2>\u53c2\u8003<\/h2>\n<ul>\n<li><a href=\"https:\/\/tako.nakano.net\/blog\/2025\/08\/cloud-run-gpu-context-part-1\/\">Cloud Run \u3067 GPU \u3092\u4f7f\u3046\u3068\u304d\u306e context window \u306e\u9650\u754c \u305d\u306e1<\/a><\/li>\n<li><a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">Release Notes 2025-04-07<\/a> Configuring GPU in your Cloud Run service is now generally available (GA). <a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">Cloud Run GPU + Ollama gemma2 \u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u3092\u56f3\u3063\u3066\u307f\u308b<\/a> \u4f50\u85e4\u6167\u592a <a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">Google \u304c Gemma 3 \u3092\u30ea\u30ea\u30fc\u30b9\uff01-Cloud Run \u3067\u52d5\u304b\u3057\u3066\u307f\u305f-<\/a> \u6751\u677e <a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2<\/a><\/li>\n<\/ul>\n<hr \/>\n<h1>Context Window Limits When Using a GPU on Cloud Run, Part 2<\/h1>\n<h2>Conclusion<\/h2>\n<p>When creating and accessing a service on Cloud Run combining the 4b and 12b models of <a href=\"https:\/\/ollama.com\/library\/gemma3\">Gemma 3<\/a> with an NVIDIA L4 GPU, it is thought that processing can be done only by the GPU up to a <code>num_ctx<\/code> of 95934 and 28064, respectively.<\/p>\n<p>With <code>llama3-gradient:8b<\/code>, based on the log output, it was considered that the range processable only by the GPU is exceeded between 21503 and 22528. Looking only at the log output, it&#8217;s possible that changing the model allowed a larger <code>num_ctx<\/code> to be processed only by the GPU. Also, the measured performance limit was concluded to be &quot;it seems safe to say that the <strong>performance limit is around 20,970<\/strong> or slightly below it.&quot; Despite the 12b model being a larger model, the range that can be processed only by the GPU has become wider, and it is possible that Gemma 3 has an advantage.<\/p>\n<h2>Introduction<\/h2>\n<p>In the previous article <a href=\"https:\/\/tako.nakano.net\/blog\/2025\/08\/cloud-run-gpu-context-part-1\/\">Context window limits when using a GPU with Cloud Run Part 1<\/a>, I introduced the investigation results regarding GPU usage on Cloud Run using <code>llama3-gradient:8b<\/code> as the model.<\/p>\n<p>With <code>llama3-gradient:8b<\/code>, it was thought that:<\/p>\n<ol>\n<li>Based on the log output, it appears that the processing exceeds what the GPU alone can handle somewhere between num_ctx values of 21,503 and 22,528<\/li>\n<li>It seems safe to say that the <strong>performance limit is around 20,970<\/strong> or slightly below it<\/li>\n<\/ol>\n<p>In this article, I will introduce the results of a similar investigation using Gemma 3&#8217;s <code>gemma3:4b<\/code> and <code>gemma3:12b<\/code>.<\/p>\n<h2>Details<\/h2>\n<p>The <code>Dockerfile<\/code> was as follows.<\/p>\n<pre><code class=\"language-Dockerfile:Dockerfile\">FROM ollama\/ollama\nENV OLLAMA_HOST 0.0.0.0:8080\nENV OLLAMA_MODELS \/models\nENV OLLAMA_DEBUG false\nENV OLLAMA_KEEP_ALIVE -1 \nENV MODEL1 gemma3:4b\nENV MODEL2 gemma3:12b\nRUN ollama serve &amp; sleep 5 &amp;&amp; ollama pull $MODEL1 &amp;&amp; ollama pull $MODEL2\nENTRYPOINT [&quot;ollama&quot;, &quot;serve&quot;]<\/code><\/pre>\n<p>For the image created this way, a Cloud Run service was created with the following settings.<\/p>\n<ul>\n<li>Startup CPU boost: Enabled<\/li>\n<li>Concurrency: 1<\/li>\n<li>CPU limit: 8<\/li>\n<li>Memory limit: 32 GiB<\/li>\n<li>GPU: 1 NVIDIA L4 (no zonal redundancy)<\/li>\n<li><code>OLLAMA_NUM_PARALLEL<\/code>: 4<\/li>\n<\/ul>\n<p>Regarding <code>OLLAMA_NUM_PARALLEL<\/code>, 1 might have been better, but I used the same value as last time. For Concurrency, I specified 1 because I wanted one access per container.<\/p>\n<p>The process of changing <code>num_ctx<\/code> and the code samples are the same as in the previous article, so they are omitted.<\/p>\n<h3><code>gemma3:4b<\/code><\/h3>\n<p>In the Gemma 3 model, there is output like this:<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T15:01:46.131Z level=INFO source=sched.go:786 msg=&quot;new model will fit in available VRAM in single GPU, loading&quot; model=\/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 gpu=GPU-093094b0-2571-df4b-e150-74f8cbee7791 parallel=4 available=23603838976 required=&quot;7.4 GiB&quot;\ntime=2025-08-11T15:01:46.213Z level=INFO source=server.go:135 msg=&quot;system memory&quot; total=&quot;31.3 GiB&quot; free=&quot;30.5 GiB&quot; free_swap=&quot;0 B&quot;\ntime=2025-08-11T15:01:46.215Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=35 layers.offload=35 layers.split=&quot;&quot; memory.available=&quot;[22.0 GiB]&quot; memory.gpu_overhead=&quot;0 B&quot; memory.required.full=&quot;7.4 GiB&quot; memory.required.partial=&quot;7.4 GiB&quot; memory.required.kv=&quot;1.7 GiB&quot; memory.required.allocations=&quot;[7.4 GiB]&quot; memory.weights.total=&quot;2.3 GiB&quot; memory.weights.repeating=&quot;1.8 GiB&quot; memory.weights.nonrepeating=&quot;525.0 MiB&quot; memory.graph.full=&quot;1.1 GiB&quot; memory.graph.partial=&quot;1.6 GiB&quot; projector.weights=&quot;795.9 MiB&quot; projector.graph=&quot;1.0 GiB&quot;\ntime=2025-08-11T15:01:46.318Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 62048 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 36847&quot;\ntime=2025-08-11T15:01:46.319Z level=INFO source=sched.go:481 msg=&quot;loaded runners&quot; count=1\ntime=2025-08-11T15:01:46.319Z level=INFO source=server.go:598 msg=&quot;waiting for llama runner to start responding&quot;\ntime=2025-08-11T15:01:46.319Z level=INFO source=server.go:632 msg=&quot;waiting for server to become available&quot; status=&quot;llm server not responding&quot;\ntime=2025-08-11T15:01:46.336Z level=INFO source=runner.go:925 msg=&quot;starting ollama engine&quot;\ntime=2025-08-11T15:01:46.338Z level=INFO source=runner.go:983 msg=&quot;Server listening on 127.0.0.1:36847&quot;\ntime=2025-08-11T15:01:46.436Z level=INFO source=ggml.go:92 msg=&quot;&quot; architecture=gemma3 file_type=Q4_K_M name=&quot;&quot; description=&quot;&quot; num_tensors=883 num_key_values=36\ntime=2025-08-11T15:01:46.571Z level=INFO source=server.go:632 msg=&quot;waiting for server to become available&quot; status=&quot;llm server loading model&quot;\nggml_cuda_init: GGML_CUDA_FORCE_MMQ: no\nggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no\nggml_cuda_init: found 1 CUDA devices:\nDevice 0: NVIDIA L4, compute capability 8.9, VMM: yes\nload_backend: loaded CUDA backend from \/usr\/lib\/ollama\/libggml-cuda.so\nload_backend: loaded CPU backend from \/usr\/lib\/ollama\/libggml-cpu-skylakex.so\ntime=2025-08-11T15:01:47.630Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:365 msg=&quot;offloading 34 repeating layers to GPU&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:371 msg=&quot;offloading output layer to GPU&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:376 msg=&quot;offloaded 35\/35 layers to GPU&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;3.1 GiB&quot;\ntime=2025-08-11T15:01:47.745Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;525.0 MiB&quot;\ntime=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T15:01:48.118Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;0 B&quot;\ntime=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T15:01:48.134Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;5.0 MiB&quot;<\/code><\/pre>\n<p>but as <code>num_ctx<\/code> increases, it would sometimes change as follows.<\/p>\n<p>The following is an example with <code>gemma3:4b<\/code> when <code>num_ctx<\/code> is 220000, which took 102 seconds:<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:365 msg=&quot;offloading 22 repeating layers to GPU&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:369 msg=&quot;offloading output layer to CPU&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:376 msg=&quot;offloaded 22\/35 layers to GPU&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;2.4 GiB&quot;\ntime=2025-08-11T15:23:37.290Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;1.2 GiB&quot;\ntime=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;0 B&quot;\ntime=2025-08-11T15:23:37.663Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;\ntime=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;9.0 GiB&quot;\ntime=2025-08-11T15:23:40.687Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;8.0 GiB&quot;<\/code><\/pre>\n<ol>\n<li>The &quot;offloaded 22\/35 layers to GPU&quot; part does not become 35\/35<\/li>\n<li>What follows &quot;offloading output layer to&quot; becomes CPU<\/li>\n<li>In model weights&#8217; buffer=CPU, the size takes a value other than 525.0 MiB<\/li>\n<li>In the second compute graph&#8217;s backend=CPU buffer_type=CPU, the size takes a value other than 5.0 MiB<\/li>\n<li>In the second compute graph&#8217;s backend=CUDA0 buffer_type=CUDA0, the size takes a value other than 1.1 GiB<\/li>\n<\/ol>\n<p>This is what started to happen. Here, first, I decided to search for the minimum <code>num_ctx<\/code> where either:<\/p>\n<ol>\n<li>What follows &#8216;offloading output layer to&#8217; is CPU, or the layer is not 35\/35 GPU<\/li>\n<li>Regarding model weights, it takes a value larger than 525.0 MiB<\/li>\n<\/ol>\n<p>95934 11.46 seconds (excerpt only)<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T16:09:20.035Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383736 --batch-size 512 --n-gpu-layers 35 --threads 4 --parallel 4 --port 41987&quot;\ntime=2025-08-11T16:09:20.228Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T16:09:20.301Z level=INFO source=ggml.go:365 msg=&quot;offloading 34 repeating layers to GPU&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:371 msg=&quot;offloading output layer to GPU&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:376 msg=&quot;offloaded 35\/35 layers to GPU&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;525.0 MiB&quot;\ntime=2025-08-11T16:09:20.302Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;3.1 GiB&quot;\ntime=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T16:09:20.648Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;0 B&quot;\ntime=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;6.6 GiB&quot;\ntime=2025-08-11T16:09:20.695Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;5.0 MiB&quot;<\/code><\/pre>\n<p>95936 15.71 seconds (excerpt only)<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T16:09:46.928Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --ctx-size 383744 --batch-size 512 --n-gpu-layers 34 --threads 4 --parallel 4 --port 44603&quot;\ntime=2025-08-11T16:09:47.143Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:365 msg=&quot;offloading 34 repeating layers to GPU&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:369 msg=&quot;offloading output layer to CPU&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:376 msg=&quot;offloaded 34\/35 layers to GPU&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;1.8 GiB&quot;\ntime=2025-08-11T16:09:47.213Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;1.8 GiB&quot;\ntime=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;0 B&quot;\ntime=2025-08-11T16:09:47.565Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;\ntime=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;6.6 GiB&quot;\ntime=2025-08-11T16:09:47.611Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;<\/code><\/pre>\n<h3><code>gemma3:12b<\/code><\/h3>\n<p>GPU-only limit<br \/>\n28062 41.2 seconds (excerpt only)<br \/>\n49\/49 and the output layer is also on the GPU, and model weights CPU is 787.5 MiB<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T17:03:10.202Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=49 layers.split=&quot;&quot; memory.available=&quot;[22.0 GiB]&quot; memory.gpu_overhead=&quot;0 B&quot; memory.required.full=&quot;21.1 GiB&quot; memory.required.partial=&quot;21.1 GiB&quot; memory.required.kv=&quot;8.3 GiB&quot; memory.required.allocations=&quot;[21.1 GiB]&quot; memory.weights.total=&quot;6.8 GiB&quot; memory.weights.repeating=&quot;6.0 GiB&quot; memory.weights.nonrepeating=&quot;787.5 MiB&quot; memory.graph.full=&quot;3.7 GiB&quot; memory.graph.partial=&quot;4.5 GiB&quot; projector.weights=&quot;795.9 MiB&quot; projector.graph=&quot;1.0 GiB&quot;\ntime=2025-08-11T17:03:10.296Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112248 --batch-size 512 --n-gpu-layers 49 --threads 4 --parallel 4 --port 42727&quot;\ntime=2025-08-11T17:03:10.513Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:365 msg=&quot;offloading 48 repeating layers to GPU&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:371 msg=&quot;offloading output layer to GPU&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:376 msg=&quot;offloaded 49\/49 layers to GPU&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;787.5 MiB&quot;\ntime=2025-08-11T17:03:10.580Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;7.6 GiB&quot;\ntime=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;1.1 GiB&quot;\ntime=2025-08-11T17:03:10.968Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;0 B&quot;\ntime=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;3.7 GiB&quot;\ntime=2025-08-11T17:03:11.023Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;7.5 MiB&quot;<\/code><\/pre>\n<p>State where CPU starts to be used<br \/>\n28064 23.5 seconds (excerpt only)<\/p>\n<p>Becomes 48\/49, the output layer is on the CPU, and model weights CPU is 2.3 GiB<\/p>\n<pre><code class=\"language-text\">time=2025-08-11T17:12:17.380Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=49 layers.offload=48 layers.split=&quot;&quot; memory.available=&quot;[22.0 GiB]&quot; memory.gpu_overhead=&quot;0 B&quot; memory.required.full=&quot;22.0 GiB&quot; memory.required.partial=&quot;19.4 GiB&quot; memory.required.kv=&quot;8.3 GiB&quot; memory.required.allocations=&quot;[19.4 GiB]&quot; memory.weights.total=&quot;6.8 GiB&quot; memory.weights.repeating=&quot;6.0 GiB&quot; memory.weights.nonrepeating=&quot;787.5 MiB&quot; memory.graph.full=&quot;3.7 GiB&quot; memory.graph.partial=&quot;4.5 GiB&quot; projector.weights=&quot;795.9 MiB&quot; projector.graph=&quot;1.0 GiB&quot;\ntime=2025-08-11T17:12:17.477Z level=INFO source=server.go:438 msg=&quot;starting llama server&quot; cmd=&quot;\/usr\/bin\/ollama runner --ollama-engine --model \/models\/blobs\/sha256-e8ad13eff07a78d89926e9e8b882317d082ef5bf9768ad7b50fcdbbcd63748de --ctx-size 112256 --batch-size 512 --n-gpu-layers 48 --threads 4 --parallel 4 --port 38065&quot;\ntime=2025-08-11T17:12:17.705Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:365 msg=&quot;offloading 48 repeating layers to GPU&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:369 msg=&quot;offloading output layer to CPU&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:376 msg=&quot;offloaded 48\/49 layers to GPU&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CPU size=&quot;2.3 GiB&quot;\ntime=2025-08-11T17:12:17.772Z level=INFO source=ggml.go:379 msg=&quot;model weights&quot; buffer=CUDA0 size=&quot;6.0 GiB&quot;\ntime=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;0 B&quot;\ntime=2025-08-11T17:12:18.171Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;\ntime=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CUDA0 buffer_type=CUDA0 size=&quot;3.7 GiB&quot;\ntime=2025-08-11T17:12:18.223Z level=INFO source=ggml.go:668 msg=&quot;compute graph&quot; backend=CPU buffer_type=CPU size=&quot;1.1 GiB&quot;<\/code><\/pre>\n<p>Although the required time has reversed, it is clear from the logs that the state is changing from GPU-only processing to a state that also starts using the CPU.<\/p>\n<h2>Summary<\/h2>\n<p>When creating and accessing a service on Cloud Run combining the <code>gemma3:4b<\/code> and <code>gemma3:12b<\/code> models with an NVIDIA L4 GPU, it was confirmed that processing seems to be possible with only the GPU for <code>num_ctx<\/code> up to 95934 and 28064, respectively.<\/p>\n<p>However, although the detailed cause is unknown, it was also found that starting to use the CPU does not necessarily mean the processing time will immediately become longer.<\/p>\n<h2>References<\/h2>\n<ul>\n<li><a href=\"https:\/\/tako.nakano.net\/blog\/2025\/08\/cloud-run-gpu-context-part-1\/\">Context Window Limits When Using a GPU on Cloud Run, Part 1<\/a><\/li>\n<li><a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">Release Notes 2025-04-07<\/a> Configuring GPU in your Cloud Run service is now generally available (GA). <a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">Measuring the performance of Cloud Run GPU + Ollama gemma2<\/a> by Keita Sato <a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">Google Releases Gemma 3! -Tried Running It on Cloud Run-<\/a> by Muramatsu. <a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Cloud Run \u3067 GPU \u3092\u4f7f\u3046\u3068\u304d\u306e context window \u306e\u9650\u754c \u305d\u306e2 English follows Japanese. \u7d50\u8ad6 Cloud Run \u3067 Gemma 3 \u306e 4b, 12b \u306e\u5404\u30e2\u30c7\u30eb [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[26,13],"tags":[],"class_list":["post-614","post","type-post","status-publish","format-standard","hentry","category-cloud-run","category-google-cloud"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4dIdP-9U","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts\/614","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/comments?post=614"}],"version-history":[{"count":2,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts\/614\/revisions"}],"predecessor-version":[{"id":616,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts\/614\/revisions\/616"}],"wp:attachment":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/media?parent=614"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/categories?post=614"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/tags?post=614"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}