{"id":610,"date":"2025-08-11T23:04:00","date_gmt":"2025-08-11T14:04:00","guid":{"rendered":"https:\/\/tako.nakano.net\/blog\/?p=610"},"modified":"2025-08-12T10:56:24","modified_gmt":"2025-08-12T01:56:24","slug":"cloud-run-gpu-context-part-1","status":"publish","type":"post","link":"https:\/\/tako.nakano.net\/blog\/2025\/08\/cloud-run-gpu-context-part-1\/","title":{"rendered":"Context Window Limits When Using a GPU on Cloud Run, Part 1"},"content":{"rendered":"<h1>Cloud Run \u3067 GPU \u3092\u4f7f\u3046\u3068\u304d\u306e context window \u306e\u9650\u754c \u305d\u306e1<\/h1>\n<p>English follows Japanese.<\/p>\n<h2>\u7d50\u8ad6<\/h2>\n<p>Cloud Run \u3067 <a href=\"https:\/\/ollama.com\/library\/llama3-gradient\"><code>llama3-gradient:8b<\/code><\/a> \u30e2\u30c7\u30eb\u3068\u3001NVIDIA L4 GPU \u3092\u7d44\u307f\u5408\u308f\u305b\u305f\u30b5\u30fc\u30d3\u30b9\u3092\u4f5c\u6210\u3057\u3066\u30a2\u30af\u30bb\u30b9\u3057\u305f\u5834\u5408\u3001<code>num_ctx<\/code> \u304c 15000 \u8fba\u308a\u307e\u3067\u306f\u300110\u79d2\u524d\u5f8c\u3067\u30ec\u30b9\u30dd\u30f3\u30b9\u304c\u8fd4\u3063\u3066\u304f\u308b\u3053\u3068\u304c\u591a\u304b\u3063\u305f\u3002<\/p>\n<p>\u3055\u3089\u306b <code>num_ctx<\/code> \u3092\u5909\u5316\u3055\u305b\u305f\u5834\u5408\u300119000 \u8fba\u308a\u304b\u3089\u6027\u80fd\u306e\u52a3\u5316\u304c\u898b\u3089\u308c\u308b\u3053\u3068\u304c\u3042\u308a\u3001\u7c21\u5358\u306a\u8a66\u9a13\u3092\u3057\u305f\u7bc4\u56f2\u306b\u304a\u3044\u3066\u306f\u3001<code>num_ctx<\/code> \u3092 20970 \u4ee5\u4e0a\u306b\u8a2d\u5b9a\u3057\u306620\u79d2\u3092\u5207\u308b\u3053\u3068\u306f\u306a\u304b\u3063\u305f\u3002<\/p>\n<p>20970 \u8fba\u308a\u304b\u3001\u305d\u306e\u5c11\u3057\u624b\u524d\u306b\u6027\u80fd\u9650\u754c\u304c\u3042\u308b\u3068\u8a00\u3063\u3066\u826f\u3055\u305d\u3046\u3067\u3042\u308b\u3002<\/p>\n<p>\u30ed\u30b0\u51fa\u529b\u3092\u5143\u306b\u898b\u308b\u3068\u300121503 \u304b\u3089 22528 \u306e\u9593\u3067 GPU \u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u308b\u7bc4\u56f2\u3092\u8d85\u3048\u308b\u3068\u8003\u3048\u3089\u308c\u308b\u3002<\/p>\n<h2>\u306f\u3058\u3081\u306b<\/h2>\n<p>Cloud Run \u3067 GPU \u304c\u4f7f\u3048\u308b\u3068\u3044\u3046\u6a5f\u80fd\u304c\u3042\u308a\u307e\u3059\u30022024-08-21 \u306b\u30d7\u30ec\u30d3\u30e5\u30fc\u63d0\u4f9b\u304c\u958b\u59cb\u3055\u308c\u30012025-04-07 \u306b\u4e00\u822c\u63d0\u4f9b\u304c\u958b\u59cb\u3055\u308c\u307e\u3057\u305f\u30022024\u5e74\u5f8c\u534a\u9803\u304b\u3089\u30d7\u30ec\u30d3\u30e5\u30fc\u306b\u53c2\u52a0\u3057\u8abf\u67fb\u30fb\u30c6\u30b9\u30c8\u3092\u884c\u3063\u3066\u3044\u307e\u3057\u305f\u3002<\/p>\n<p>\u30b5\u30fc\u30d3\u30b9\u3068\u3057\u3066\u306e LLM \u3067\u306f\u306a\u304f\u3001\u81ea\u524d\u3067 LLM \u3092\u30db\u30b9\u30c8\u3059\u308b\u30e1\u30ea\u30c3\u30c8\u306b\u3001\u7279\u5b9a\u306e\u30d0\u30fc\u30b8\u30e7\u30f3\u3092\u4f7f\u3044\u7d9a\u3051\u3089\u308c\u308b\u3053\u3068\u3084\u3001\u30ab\u30b9\u30bf\u30de\u30a4\u30ba\u3067\u304d\u308b\u3053\u3068\u3001\u60c5\u5831\u304c\u6f0f\u308c\u306a\u3044\u3053\u3068\u306a\u3069\u304c\u3042\u308a\u307e\u3059\u3002<\/p>\n<p>Cloud Run \u3092\u4f7f\u3046\u3068\u300c\u5fc5\u8981\u306a\u3068\u304d\u306b\u5fc5\u8981\u306a\u3060\u3051\u300d\u30ea\u30bd\u30fc\u30b9\u3092\u4f7f\u3048\u308b\u306e\u3067\u3001\u30b3\u30b9\u30c8\u3092\u6291\u3048\u3064\u3064 LLM \u3092\u30db\u30b9\u30c8\u3067\u304d\u307e\u3059\u3002\u305d\u3053\u3067\u30cd\u30c3\u30af\u306b\u306a\u308b\u306e\u304c\u3001GPU \u306e\u30e1\u30e2\u30ea\u91cf\u3067\u3059\u3002\u30b3\u30fc\u30c9\u30ec\u30d3\u30e5\u30fc\u7b49\u306b\u304a\u3044\u3066\u306f\u3001\u3067\u304d\u308b\u3060\u3051\u5e83\u3044\u7bc4\u56f2\u306e\u30bd\u30fc\u30b9\u30b3\u30fc\u30c9\u3092\u53c2\u7167\u3055\u305b\u306a\u304c\u3089\u56de\u7b54\u3092\u5f97\u305f\u3044\u306e\u3067\u3001context window \u306e\u9577\u3055\u304c\u91cd\u8981\u306b\u306a\u308a\u307e\u3059\u3002\u4e00\u65b9\u3067\u3001GPU \u306e\u30e1\u30e2\u30ea\u91cf\u306f\u9650\u3089\u308c\u3066\u3044\u308b\u306e\u3067\u3001context window \u306e\u9577\u3055\u306b\u5236\u9650\u304c\u3042\u308a\u307e\u3059\u3002<\/p>\n<p>\u305d\u3082\u305d\u3082\u306e LLM \u306b\u3082 context window \u306e\u9650\u754c\u304c\u3042\u308a\u307e\u3059\u3002\u672c\u8a18\u4e8b\u3067\u306f\u3001\u7279\u5b9a\u306e\u30e2\u30c7\u30eb (<code>llama3-gradient:8b<\/code>) \u3092\u63a1\u7528\u3057\u3001<code>num_ctx<\/code> \u30d1\u30e9\u30e1\u30fc\u30bf\u3092\u5897\u3084\u3057\u3066\u3044\u304f\u3068\u3001\u3042\u308b\u7a0b\u5ea6\u306e\u3068\u3053\u308d\u304b\u3089\uff08\u304a\u305d\u3089\u304f GPU \u3060\u3051\u3067\u51e6\u7406\u3067\u304d\u308b\u7bc4\u56f2\u3092\u8d85\u3048\u308b\u3068\u3001CPU \u3082\u4f75\u7528\u3055\u308c\u308b\u3088\u3046\u306b\u306a\u308a\uff09\u51e6\u7406\u6642\u9593\u304c\u9577\u304f\u306a\u308b\u3053\u3068\u3092\u78ba\u8a8d\u3057\u307e\u3057\u305f\u3002\u63fa\u3089\u304e\u3082\u5927\u304d\u304b\u3063\u305f\u305f\u3081\u3001\u7d30\u304b\u306a\u30c7\u30fc\u30bf\u306f\u63b2\u8f09\u3057\u307e\u305b\u3093\u304c\u3001\u5927\u4f53\u306e\u50be\u5411\u306b\u3064\u3044\u3066\u7d39\u4ecb\u3057\u307e\u3059\u3002<\/p>\n<p>\u672c\u8a18\u4e8b\u300c\u305d\u306e1\u300d\u3067\u306f <code>llama3-gradient:8b<\/code> \u30e2\u30c7\u30eb\u306b\u3064\u3044\u3066\u306e\u8abf\u67fb\u7d50\u679c\u3092\u7d39\u4ecb\u3057\u307e\u3059\u3002\u6b21\u56de\u306f Gemma 3 \u3092\u6271\u3046\u4e88\u5b9a\u3067\u3059\u3002<\/p>\n<h2>\u8a73\u7d30<\/h2>\n<p><code>Dockerfile<\/code> \u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306b\u3057\u3066\u3044\u307e\u3059\u3002Ollama \u306e\u30a4\u30e1\u30fc\u30b8\u3092\u4f7f\u3044\u3001\u5fc5\u8981\u306a\u30e2\u30c7\u30eb\u3092\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9\u3057\u3066\u304a\u304d\u307e\u3059\u3002\u30b3\u30f3\u30c6\u30ca\u30a4\u30e1\u30fc\u30b8\u306b\u30e2\u30c7\u30eb\u3092\u542b\u3081\u3066\u304a\u3044\u3066\u3044\u307e\u3059\u3002<code>gemma2:9b<\/code> \u3068 <code>codegemma:7b-instruct<\/code> \u306f\u3001\u4ed6\u306e\u30c6\u30b9\u30c8\u306e\u305f\u3081\u306b\u5165\u308c\u3066\u304a\u3044\u305f\u3082\u306e\u3067\u3059\u3002<\/p>\n<pre><code class=\"language-Dockerfile:Dockerfile\">FROM ollama\/ollama\n\n# Listen on all interfaces, port 8080\nENV OLLAMA_HOST 0.0.0.0:8080\n\n# Store model weight files in \/models\nENV OLLAMA_MODELS \/models\n\n# Reduce logging verbosity\nENV OLLAMA_DEBUG false\n\n# Never unload model weights from the GPU\nENV OLLAMA_KEEP_ALIVE -1 \n\n# Store the model weights in the container image\nENV MODEL1 gemma2:9b\nENV MODEL2 codegemma:7b-instruct\nENV MODEL3 llama3-gradient:8b\nRUN ollama serve &amp; sleep 5 &amp;&amp; ollama pull $MODEL1 &amp;&amp; ollama pull $MODEL2 &amp;&amp; ollama pull $MODEL3\n\n# Start Ollama\nENTRYPOINT [&quot;ollama&quot;, &quot;serve&quot;]<\/code><\/pre>\n<p>\u3053\u3046\u3057\u3066\u4f5c\u6210\u3057\u305f\u30a4\u30e1\u30fc\u30b8\u306b\u5bfe\u3057\u3001\u4ee5\u4e0b\u306e\u8a2d\u5b9a\u3067 Cloud Run \u30b5\u30fc\u30d3\u30b9\u3092\u4f5c\u6210\u3057\u307e\u3057\u305f\u3002<\/p>\n<ul>\n<li>Startup CPU boost: Enabled<\/li>\n<li>Concurrency: 4<\/li>\n<li>CPU limit: 8<\/li>\n<li>Memory limit: 32 GiB<\/li>\n<li>GPU: 1 NVIDIA L4 (no zonal redundancy)<\/li>\n<li><code>OLLAMA_NUM_PARALLEL<\/code>: 4<\/li>\n<\/ul>\n<p><code>OLLAMA_NUM_PARALLEL<\/code> \u306b\u3064\u3044\u3066\u306f1\u306e\u65b9\u304c\u826f\u304b\u3063\u305f\u304b\u3082\u3057\u308c\u307e\u305b\u3093\u304c\u3001\u3059\u3067\u306b\u8a2d\u5b9a\u3057\u3066\u3057\u307e\u3063\u305f\u306e\u3067\u3001\u3053\u308c\u3067\u30c6\u30b9\u30c8\u3092\u884c\u3044\u307e\u3057\u305f\u3002<code>options<\/code> \u306b <code>num_ctx<\/code> \u3092\u4e0e\u3048\u308b\u3068\u3001context window \u3092\u5909\u3048\u3089\u308c\u308b\u30e2\u30c7\u30eb\u304c\u3042\u308a\u307e\u3059\u3002<code>llama3-gradient:8b<\/code> \u306f\u305d\u306e\u3088\u3046\u306a\u30e2\u30c7\u30eb\u306e\u4e00\u3064\u3067\u3059\u3002<\/p>\n<p><a href=\"https:\/\/ollama.com\/library\/llama3-gradient\">llama3-gradient<\/a> \u306b\u306f<\/p>\n<blockquote>\n<p>This model extends LLama-3 8B&#8217;s context length from 8k to over 1m tokens.<\/p>\n<\/blockquote>\n<p>\u3068\u66f8\u304b\u308c\u3066\u3044\u307e\u3059\u3002\u300c\u3053\u306e\u8a18\u4e8b\u300d\u306f\u30b3\u30fc\u30c9\u90e8\u5206\u3082\u542b\u3081\u308b\u306815000\u6587\u5b57\u3092\u8d85\u3048\u3066\u3044\u308b\u306e\u3067\u30018k \u30c8\u30fc\u30af\u30f3\u3067\u306f\u304a\u305d\u3089\u304f\u5168\u6587\u3092\u51e6\u7406\u3059\u308b\u306b\u306f\u4e0d\u8db3\u3057\u3066\u3044\u308b\u3067\u3057\u3087\u3046\u30021m \u306b\u306a\u308b\u3068\u3001\u304b\u306a\u308a\u4f59\u88d5\u304c\u3042\u308a\u307e\u3059\u3002<\/p>\n<p>\u4e2d\u5fc3\u3068\u306a\u308b\u30b3\u30fc\u30c9\u306f\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u3082\u306e\u3067\u3059\u3002<code>num_ctx<\/code> \u3092\u6307\u5b9a\u3057\u3066\u30ea\u30af\u30a8\u30b9\u30c8\u3092\u9001\u308a\u307e\u3059\u3002\u7d30\u304b\u3044\u90e8\u5206\u306b\u3064\u3044\u3066\u306f <a href=\"#code\">Code<\/a> \u306e\u90e8\u5206\u306b\u63b2\u8f09\u3057\u3066\u3044\u308b\u306e\u3067\u3001\u305d\u3061\u3089\u3092\u53c2\u7167\u3057\u3066\u304f\u3060\u3055\u3044\u3002<\/p>\n<pre><code class=\"language-python\">data = {\n    &quot;model&quot;: &quot;llama3-gradient:8b&quot;,\n    &quot;prompt&quot;: system + code,\n    &quot;stream&quot;: False,\n    &quot;options&quot;: {\n        &quot;num_ctx&quot;: num_ctx,\n    }\n}\n\nresponse = requests.post(\n    &quot;https:\/\/ service URL .us-central1.run.app\/api\/generate&quot;,\n    data=json.dumps(data),\n    headers={&quot;Content-Type&quot;: &quot;application\/json&quot;}\n)<\/code><\/pre>\n<p>\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u306b\u3064\u3044\u3066\u306f\u5b89\u5b9a\u3057\u306a\u3044\u3053\u3068\u3082\u3042\u308b\u306e\u3067\u3059\u304c\u3001\u65e9\u3044\u5834\u5408\u306f10\u79d2\u524d\u5f8c\u3067\u30ec\u30b9\u30dd\u30f3\u30b9\u304c\u8fd4\u3063\u3066\u304d\u307e\u3059\u3002\u9045\u3044\u306830\u79d2\u4ee5\u4e0a\u304b\u304b\u308b\u3053\u3068\u3084\u3001300\u79d2\u8fd1\u304f\u306b\u306a\u308b\u3053\u3068\u3082\u3042\u308a\u307e\u3059\u3002\u4f8b\u3048\u3070\u3001<code>num_ctx<\/code> \u304c 20975 \u3067\u3001271\u79d2\u304b\u304b\u3063\u305f\u3053\u3068\u304c\u3042\u308a\u307e\u3057\u305f\u3002<\/p>\n<p>\u65e9\u3044\u3068\u304d\u306b10\u79d2\u524d\u5f8c\u3067\u8fd4\u3063\u3066\u304f\u308b\u306e\u3067\u3042\u308c\u3070\u300120\u79d2\u3088\u308a\u9577\u304f\u304b\u304b\u308b\u5834\u5408\u306f\u300c\u9045\u3044\u300d\u3068\u5224\u65ad\u3059\u308b\u3053\u3068\u306b\u3057\u3066\u307f\u307e\u3057\u305f\u3002\u306a\u304a\u3001\u9014\u4e2d\u3067\u300130\u79d2\u306b\u5909\u66f4\u3057\u3066\u30eb\u30fc\u30d7\u3092\u56de\u3057\u3066\u3044\u308b\u3053\u3068\u3082\u3042\u308a\u307e\u3059\u3002<\/p>\n<p>\u4f50\u85e4\u6167\u592a\u306b\u3088\u308b <a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">Cloud Run GPU + Ollama gemma2 \u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u3092\u56f3\u3063\u3066\u307f\u308b<\/a> \u3067\u306f\u3001<code>gemma2:9b<\/code> \u306b\u5bfe\u3057\u3066 k6 \u3067\u8ca0\u8377\u30c6\u30b9\u30c8\u3092\u884c\u3063\u305f\u8a18\u4e8b\u3067\u3059\u304c\u3001context window \u306f\u6307\u5b9a\u305b\u305a\u4e26\u5217\u30a2\u30af\u30bb\u30b9\u3092\u884c\u3063\u305f\u969b\u306b<\/p>\n<blockquote>\n<p>response time P95 45s<\/p>\n<\/blockquote>\n<p>\u3068\u3044\u3046\u7d50\u679c\u3092\u5f97\u3066\u3044\u307e\u3057\u305f\u3002\u8a18\u4e8b\u4e2d\u306e\u753b\u50cf\u3092\u898b\u308b\u3068\u3001AVG \u304c 20 s \u3068\u3044\u3046\u4f8b\u304c\u898b\u3048\u308b\u306e\u3067\u3001\u300c\u901a\u5e38\u300d\u309220\u79d2\u524d\u5f8c\u3068\u60f3\u5b9a\u3059\u308b\u306e\u306f\u59a5\u5f53\u3060\u3068\u601d\u3044\u307e\u3059\u3002<\/p>\n<p>\u8907\u6570\u56de\u3001\u30ec\u30b9\u30dd\u30f3\u30b9\u304c\u8fd4\u3063\u3066\u3053\u306a\u3044 <code>num_ctx<\/code> \u3082\u3042\u308a\u307e\u3057\u305f\uff08\u4e0d\u601d\u8b70\u306a\u3053\u3068\u306b\u3001\u5358\u7d14\u306a\u7dda\u5f62\u3067\u306f\u306a\u3044\u3088\u3046\u3067\u3057\u305f\uff09\u3002\u4f8b\u3048\u3070\u300120966 \u3067\u8907\u6570\u56de20\u79d2\u3092\u5207\u3063\u3066\u304a\u308a\u30019.9\u79d2\u3067\u30ec\u30b9\u30dd\u30f3\u30b9\u304c\u8fd4\u3063\u3066\u304d\u305f\u3053\u3068\u3082\u3042\u308a\u307e\u3057\u305f\u3002<\/p>\n<p>\u3068\u306f\u3044\u3048\u300118750 \u3067\u3082\u30bf\u30a4\u30e0\u30a2\u30a6\u30c8\u3057\u3066\u3057\u307e\u3063\u305f\u3053\u3068\u3082\u3042\u308a\u307e\u3057\u305f\u3002<br \/>\n20970 \u306b\u304a\u3044\u3066\u306f\u30013\u56de\u30a2\u30af\u30bb\u30b9\u3057\u30012\u56de\u306f\u30bf\u30a4\u30e0\u30a2\u30a6\u30c8\u30011\u56de\u306f32\u79d2\u304b\u304b\u308a\u307e\u3057\u305f\u3002<br \/>\n20964 \u306b\u304a\u3044\u3066\u306f\u30016\u56de\u30a2\u30af\u30bb\u30b9\u3057\u30012\u56de\u306f\u30bf\u30a4\u30e0\u30a2\u30a6\u30c8\u30014\u56de\u306f12, 13, 31, 26\u79d2\u304b\u304b\u308a\u307e\u3057\u305f\u3002\u5b89\u5b9a\u3057\u3066\u3044\u307e\u305b\u3093\u3002<\/p>\n<p>20971 \u306731\u79d2\u300120975 \u3067271\u79d2\uff08\u6ce8: \u6570\u5024\u306f\u5408\u3063\u3066\u3044\u307e\u3059\uff09\u300120991 \u306727\u79d2\u300121503 \u306746\u79d2\u300122528 \u306745\u79d2\u304b\u304b\u3063\u305f\u30c7\u30fc\u30bf\u304c\u53d6\u308c\u3066\u3044\u307e\u3059\u300220970 \u3092\u8d85\u3048\u3066\u3044\u3066\u300120\u79d2\u3092\u5207\u3063\u305f\u30c7\u30fc\u30bf\u306f\u3042\u308a\u307e\u305b\u3093\u3067\u3057\u305f\u3002\u3053\u3053\u304b\u3089\u300120970 \u3042\u305f\u308a\u304c\u9650\u754c\u3060\u3068\u5224\u65ad\u3057\u307e\u3057\u305f\u3002<\/p>\n<p>\u30af\u30e9\u30a6\u30c9\u30a8\u30fc\u30b9\u306e\u6751\u677e\u306b\u3088\u308b <a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">Google \u304c Gemma 3 \u3092\u30ea\u30ea\u30fc\u30b9\uff01-Cloud Run \u3067\u52d5\u304b\u3057\u3066\u307f\u305f-<\/a> \u3067\u306f\u3001<code>gemma2:27b<\/code>, <code>gemma3:12b<\/code> \u3068\u3001\u672c\u8a18\u4e8b\u3068\u30e2\u30c7\u30eb\u306f\u7570\u306a\u3063\u3066\u3044\u307e\u3059\u304c <code>num_ctx<\/code> \u3092 16384 \u306b\u3057\u3066\u30c6\u30b9\u30c8\u3057\u3066\u7d50\u679c\u3092\u5f97\u3066\u3044\u307e\u3059\u3002\u3088\u3063\u3066\u3001\u3053\u308c\u3089\u306e\u30e2\u30c7\u30eb\u3067\u3042\u3063\u3066\u3082\u3001\u30c7\u30d5\u30a9\u30eb\u30c8\u306e300\u79d2\u3042\u308b\u3044\u306f\u3001\u305d\u308c\u3092\u3082\u3046\u5c11\u3057\u4f38\u3070\u3057\u305f\u6642\u9593\u3067\u30ec\u30b9\u30dd\u30f3\u30b9\u304c\u8fd4\u3063\u3066\u304d\u3066\u3044\u308b\u3053\u3068\u304c\u793a\u5506\u3055\u308c\u307e\u3059\u3002<\/p>\n<h3>offload&#8230; to GPU<\/h3>\n<p>\u30ec\u30b9\u30dd\u30f3\u30b9\u304c\u8fd4\u3063\u3066\u304f\u308b\u6642\u9593\u3068\u306f\u5225\u306b\u3001\u30ed\u30b0\u306b\u4ee5\u4e0b\u306e\u3088\u3046\u306a\u51fa\u529b\u304c\u898b\u3089\u308c\u307e\u3059\u3002<\/p>\n<pre><code class=\"language-text\">llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)\nllm_load_tensors: ggml ctx size = 0.27 MiB\nllm_load_tensors: offloading 32 repeating layers to GPU\nllm_load_tensors: offloading non-repeating layers to GPU\nllm_load_tensors: offloaded 33\/33 layers to GPU\nllm_load_tensors: CPU buffer size = 281.81 MiB\nllm_load_tensors: CUDA0 buffer size = 4155.99 MiB\nllama_kv_cache_init: CUDA0 KV buffer size = 10752.00 MiB\nllama_new_context_with_model: KV self size = 10752.00 MiB, K (f16): 5376.00 MiB, V (f16): 5376.00 MiB\nllama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB\nllama_new_context_with_model: CUDA0 compute buffer size = 5576.00 MiB\nllama_new_context_with_model: CUDA_Host compute buffer size = 176.01 MiB\nllama_new_context_with_model: graph nodes = 1030<\/code><\/pre>\n<p>\u4ee5\u4e0b\u306e\u30ed\u30b0\u3067 &quot;offloaded 32\/33 layers to GPU&quot; \u3092\u898b\u308b\u3068\u30011\u30ec\u30a4\u30e4\u306f GPU <strong>\u3067\u306f\u306a\u3044<\/strong> \u3053\u3068\u304c\u793a\u5506\u3055\u308c\u307e\u3059\u3002\u5b9f\u969b\u3001&quot;offloading non-repeating layers to GPU&quot; \u3068\u3044\u3046\u884c\u304c\u306a\u304f\u306a\u3063\u3066\u3044\u307e\u3059\u3002<\/p>\n<pre><code class=\"language-text\">llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)\nllm_load_tensors: ggml ctx size = 0.27 MiB\nllm_load_tensors: offloading 32 repeating layers to GPU\nllm_load_tensors: offloaded 32\/33 layers to GPU\nllm_load_tensors: CPU buffer size = 4437.80 MiB\nllm_load_tensors: CUDA0 buffer size = 3745.00 MiB\nllama_kv_cache_init: CUDA0 KV buffer size = 11264.00 MiB\nllama_new_context_with_model: KV self size = 11264.00 MiB, K (f16): 5632.00 MiB, V (f16): 5632.00 MiB\nllama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB\nllama_new_context_with_model: CUDA0 compute buffer size = 5840.00 MiB\nllama_new_context_with_model: CUDA_Host compute buffer size = 184.01 MiB\nllama_new_context_with_model: graph nodes = 1030<\/code><\/pre>\n<p>1\u3064\u76ee\u306f\u3001<code>num_ctx<\/code> \u304c 21503 \u306e\u5834\u5408\u3067\u30012\u3064\u76ee\u306f <code>num_ctx<\/code> \u304c 22528 \u306e\u5834\u5408\u3067\u3059\u3002<br \/>\n\u3053\u3061\u3089\u304b\u898b\u308b\u3068\u3001\u6982\u306d 22000 \u7a0b\u5ea6\u306b\u306a\u308b\u3068\u3001GPU \u3067\u51e6\u7406\u3067\u304d\u308b\u7bc4\u56f2\u3092\u8d85\u3048\u308b\u3068\u8003\u3048\u3089\u308c\u307e\u3059\u3002<\/p>\n<h2>\u307e\u3068\u3081<\/h2>\n<p>Cloud Run \u3067 <code>llama3-gradient:8b<\/code> \u30e2\u30c7\u30eb\u3068\u3001NVIDIA L4 GPU \u3092\u7d44\u307f\u5408\u308f\u305b\u305f\u30b5\u30fc\u30d3\u30b9\u3092\u4f5c\u6210\u3057\u3066\u30a2\u30af\u30bb\u30b9\u3057\u305f\u5834\u5408\u3001<code>num_ctx<\/code> \u304c 15000 \u8fba\u308a\u307e\u3067\u306f\u300110\u79d2\u524d\u5f8c\u3067\u30ec\u30b9\u30dd\u30f3\u30b9\u304c\u8fd4\u3063\u3066\u304f\u308b\u3053\u3068\u304c\u591a\u304b\u3063\u305f\u3067\u3059\u3002<\/p>\n<p>\u3055\u3089\u306b <code>num_ctx<\/code> \u3092\u5909\u5316\u3055\u305b\u305f\u5834\u5408\u300119000 \u8fba\u308a\u304b\u3089\u6027\u80fd\u306e\u52a3\u5316\u304c\u898b\u3089\u308c\u308b\u3053\u3068\u304c\u3042\u308a\u3001\u7c21\u5358\u306a\u8a66\u9a13\u3092\u3057\u305f\u7bc4\u56f2\u306b\u304a\u3044\u3066\u306f\u3001<code>num_ctx<\/code> \u3092 20970 \u4ee5\u4e0a\u306b\u8a2d\u5b9a\u3057\u306620\u79d2\u3092\u5207\u308b\u3053\u3068\u306f\u3042\u308a\u307e\u305b\u3093\u3067\u3057\u305f\u3002<\/p>\n<p>20970 \u8fba\u308a\u304b\u3001\u305d\u306e\u5c11\u3057\u624b\u524d\u306b\u6027\u80fd\u9650\u754c\u304c\u3042\u308b\u3068\u8a00\u3063\u3066\u826f\u3055\u305d\u3046\u3067\u3059\u3002<\/p>\n<p>\u307e\u305f\u3001\u30ed\u30b0\u51fa\u529b\u3092\u5143\u306b\u898b\u308b\u3068\u300121503 \u304b\u3089 22528 \u306e\u9593\u3067 GPU \u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u308b\u7bc4\u56f2\u3092\u8d85\u3048\u308b\u3068\u8003\u3048\u3089\u308c\u307e\u3059\u3002<\/p>\n<p>\u305d\u3082\u305d\u3082\u51fa\u529b\u306b\u30e9\u30f3\u30c0\u30e0\u6027\u304c\u3042\u308b\u305f\u3081\u304b\u3001\u7d50\u679c\u306f\u5b89\u5b9a\u3057\u307e\u305b\u3093\u3002\u300c\u6027\u80fd\u306e\u52a3\u5316\u304c\u898b\u3089\u308c\u308b\u3053\u3068\u304c\u3042\u308a\u300d\u3068\u3044\u3046\u66d6\u6627\u3092\u3057\u305f\u306e\u306f\u305d\u308c\u304c\u7406\u7531\u3067\u3059\u3002<\/p>\n<h2>\u53c2\u8003<\/h2>\n<ul>\n<li><a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#August_21_2024\">Release Notes 2024-08-21<\/a> You can now configure GPU in your Cloud Run service (Preview). <a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#August_21_2024\">https:\/\/cloud.google.com\/run\/docs\/release-notes#August_21_2024<\/a><\/li>\n<li><a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">Release Notes 2025-04-07<\/a> Configuring GPU in your Cloud Run service is now generally available (GA). <a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">Cloud Run GPU + Ollama gemma2 \u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u3092\u56f3\u3063\u3066\u307f\u308b<\/a> \u4f50\u85e4\u6167\u592a <a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">Google \u304c Gemma 3 \u3092\u30ea\u30ea\u30fc\u30b9\uff01-Cloud Run \u3067\u52d5\u304b\u3057\u3066\u307f\u305f-<\/a> \u6751\u677e <a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2<\/a><\/li>\n<\/ul>\n<h2>Code<\/h2>\n<pre><code class=\"language-python:create_request_loop.py\">import json\nimport requests\nimport datetime\nimport time\n\nsystem=\"\"\"\u3042\u306a\u305f\u306f\u6975\u3081\u3066\u512a\u79c0\u306a\u30b3\u30fc\u30c9\u30ec\u30d3\u30e5\u30fc\u30a2\u3067\u3059\u3002\u30d7\u30ed\u30b0\u30e9\u30df\u30f3\u30b0\u306e\u30b3\u30fc\u30c9\u3092\u30ec\u30d3\u30e5\u30fc\u3059\u308b\u305f\u3081\u306b\u8abf\u6574\u3055\u308c\u305f LLM \u3067\u3059\u3002\n\u30b3\u30fc\u30c9\u306b\u5bfe\u3057\u3066\u3001\u5efa\u8a2d\u7684\u304b\u3064\u7684\u78ba\u306a\u30d5\u30a3\u30fc\u30c9\u30d0\u30c3\u30af\u3092\u4e0e\u3048\u3066\u304f\u3060\u3055\u3044\u3002\u307e\u305f\u3001\u610f\u5473\u306e\u3042\u308b\u63d0\u6848\u3092\u3057\u3066\u304f\u3060\u3055\u3044\u3002\u3044\u304f\u3064\u304b\u306e\u30d5\u30a1\u30a4\u30eb\u304c\u3042\u308b\u5834\u5408\u3001\u30d5\u30a1\u30a4\u30eb\u6bce\u306b\u30b3\u30e1\u30f3\u30c8\u3057\u3066\u304f\u3060\u3055\u3044\u3002\n\nCode suggestions guidelines:\n- Provide code suggestions. Try to provide diverse and insightful suggestions.\n- Focus on important suggestions like fixing code problems, issues and bugs. As a second priority, provide suggestions for meaningful code improvements, like performance, vulnerability, modularity, and best practices.\n- Avoid making suggestions that have already been implemented in the code. For example, if you want to add logs, or change a variable to const, or anything else, make sure it isn't already in the code.\n- Don't suggest to add docstring, type hints, or comments.\n\nExtra instructions from the user, that should be taken into account with high priority:\n======\nPlease use Japanese in descriptions.\n\u300c\u65e5\u672c\u8a9e\u300d\u3067\u30b3\u30e1\u30f3\u30c8\u3057\u3066\u307b\u3057\u3044\u3067\u3059\u3002\n======\n\nExample output:\n```yaml\nreview:\n  relevant_tests: |\n    No\n  key_issues_to_review:\n    - relevant_file: |\n        directory\/xxx.py\n      issue_header: |\n        Possible Bug\n      issue_content: |\n        ...\n      start_line: 12\n      end_line: 14\n    - ...\n  security_concerns: |\n    No\n\ncode_feedback:\n- relevant_file: |\n    directory\/xxx.py\n  language: |\n    python\n  suggestion: |\n    xxx [important]\n  relevant_line: |\n    xxx&lt;\/code&gt;&lt;\/pre&gt;\n&lt;p&gt;Answer should be a valid YAML, and nothing else. Each YAML output MUST be after a newline, with proper indent, and block scalar indicator (&#039;|&#039;)&lt;\/p&gt;\n&lt;p&gt;&quot;&quot;&quot;&lt;\/p&gt;\n&lt;p&gt;code=&quot;&quot;&quot;&lt;\/p&gt;\n&lt;pre&gt;&lt;code class=&quot;language-p009.cc&quot;&gt;#include &lt;algorithm&gt;\n#include &lt;bitset&gt;\n#include &lt;complex&gt;\n#include &lt;deque&gt;\n#include &lt;exception&gt;\n#include &lt;fstream&gt;\n#include &lt;functional&gt;\n#include &lt;iomanip&gt;\n#include &lt;ios&gt;\n#include &lt;iosfwd&gt;\n#include &lt;iostream&gt;\n#include &lt;istream&gt;\n#include &lt;iterator&gt;\n#include &lt;limits&gt;\n#include &lt;list&gt;\n#include &lt;locale&gt;\n#include &lt;map&gt;\n#include &lt;memory&gt;\n#include &lt;new&gt;\n#include &lt;numeric&gt;\n#include &lt;ostream&gt;\n#include &lt;queue&gt;\n#include &lt;set&gt;\n#include &lt;sstream&gt;\n#include &lt;stack&gt;\n#include &lt;stdexcept&gt;\n#include &lt;streambuf&gt;\n#include &lt;string&gt;\n#include &lt;typeinfo&gt;\n#include &lt;utility&gt;\n#include &lt;valarray&gt;\n#include &lt;vector&gt;\n\n#if __cplusplus &gt;= 201103L\n#include &lt;array&gt;\n#include &lt;atomic&gt;\n#include &lt;chrono&gt;\n#include &lt;condition_variable&gt;\n#include &lt;forward_list&gt;\n#include &lt;future&gt;\n#include &lt;initializer_list&gt;\n#include &lt;mutex&gt;\n#include &lt;random&gt;\n#include &lt;ratio&gt;\n#include &lt;regex&gt;\n#include &lt;scoped_allocator&gt;\n#include &lt;system_error&gt;\n#include &lt;thread&gt;\n#include &lt;tuple&gt;\n#include &lt;typeindex&gt;\n#include &lt;type_traits&gt;\n#include &lt;unordered_map&gt;\n#include &lt;unordered_set&gt;\n#endif\n\nusing namespace std;\n\nusing ll = long long;\nusing ld = long double;\nconst ll MOD1 = 1e9+7;\nconst ll MOD9 = 998244353;\nconst ll INF = 1e18;\nusing P = pair&lt;ll, ll&gt;;\ntemplate&lt;typename T&gt; using PQ = priority_queue&lt;T&gt;;\ntemplate&lt;typename T&gt; using QP = priority_queue&lt;T,vector&lt;T&gt;,greater&lt;T&gt;&gt;;\n\nstruct p037 {\n  string run(ll N, ll S, vector&lt;ll&gt; &amp;An) {\n    vector&lt;bool&gt; dp(S+1, false);\n\n    dp[0] = true;\n\n    for (ll i=0; i&lt;N; ++i) {\n      for (ll j=S; j&gt;=An[i]; --j) {\n        if (dp[j-An[i]]) {\n          dp[j] = true;\n        }\n      }\n    }\n\n    if (dp[S]) {\n      return &quot;Yes&quot;;\n    }\n\n    return &quot;No&quot;;\n  }\n};\n\nint main(){\n  cin.tie(nullptr);\n  ios_base::sync_with_stdio(false);\n\n  p037 solver;\n  ll N, S;\n  cin &gt;&gt; N &gt;&gt; S;\n  vector&lt;ll&gt; An(N);\n  for(ll i=0; i&lt;N; ++i) {\n    cin &gt;&gt; An[i];\n  }\n\n  cout&lt;&lt;solver.run(N, S, an)&lt;&lt;endl;\n\n  return 0;\n}&lt;\/code&gt;&lt;\/pre&gt;\n&lt;pre&gt;&lt;code class=&quot;language-p026.cc&quot;&gt;#include &lt;algorithm&gt;\n#include &lt;bitset&gt;\n#include &lt;complex&gt;\n#include &lt;deque&gt;\n#include &lt;exception&gt;\n#include &lt;fstream&gt;\n#include &lt;functional&gt;\n#include &lt;iomanip&gt;\n#include &lt;ios&gt;\n#include &lt;iosfwd&gt;\n#include &lt;iostream&gt;\n#include &lt;istream&gt;\n#include &lt;iterator&gt;\n#include &lt;limits&gt;\n#include &lt;list&gt;\n#include &lt;locale&gt;\n#include &lt;map&gt;\n#include &lt;memory&gt;\n#include &lt;new&gt;\n#include &lt;numeric&gt;\n#include &lt;ostream&gt;\n#include &lt;queue&gt;\n#include &lt;set&gt;\n#include &lt;sstream&gt;\n#include &lt;stack&gt;\n#include &lt;stdexcept&gt;\n#include &lt;streambuf&gt;\n#include &lt;string&gt;\n#include &lt;typeinfo&gt;\n#include &lt;utility&gt;\n#include &lt;valarray&gt;\n#include &lt;vector&gt;\n\n#if __cplusplus &gt;= 201103L\n#include &lt;array&gt;\n#include &lt;atomic&gt;\n#include &lt;chrono&gt;\n#include &lt;condition_variable&gt;\n#include &lt;forward_list&gt;\n#include &lt;future&gt;\n#include &lt;initializer_list&gt;\n#include &lt;mutex&gt;\n#include &lt;random&gt;\n#include &lt;ratio&gt;\n#include &lt;regex&gt;\n#include &lt;scoped_allocator&gt;\n#include &lt;system_error&gt;\n#include &lt;thread&gt;\n#include &lt;tuple&gt;\n#include &lt;typeindex&gt;\n#include &lt;type_traits&gt;\n#include &lt;unordered_map&gt;\n#include &lt;unordered_set&gt;\n#endif\n\nusing namespace std;\n\nusing ll = long long;\nusing ld = long double;\nconst ll MOD1 = 1e9+7;\nconst ll MOD9 = 998244353;\nconst ll INF = 1e18;\nusing P = pair&lt;ll, ll&gt;;\ntemplate&lt;typename T&gt; using PQ = priority_queue&lt;T&gt;;\ntemplate&lt;typename T&gt; using QP = priority_queue&lt;T,vector&lt;T&gt;,greater&lt;T&gt;&gt;;\n\nstruct p037 {\n  ld run(ll N) {\n    ld ans = 0;\n    \/\/ \u3042\u308b\u30b3\u30a4\u30f3\u306b\u3064\u3044\u3066\u3001\u51fa\u308b\u78ba\u7387\u304c 1\/N \u3067\u3001\u51fa\u306a\u3044\u78ba\u7387\u304c (N-1)\/N \u3067\u3042\u308b\u3002\n    \/\/ \u3053\u306e\u30b3\u30a4\u30f3\u304c\u51fa\u308b\u307e\u3067\u306e\u8a66\u884c\u56de\u6570\u306e\u671f\u5f85\u5024\u306f\u30011\/N * (1 + 2 * ((N-1)\/N) + 3 * ((N-1)\/N)^2 + ...) \u3067\u3042\u308b\u3002\n    \/\/ f(x) = x + x^2 + x^3 + ... (0 &lt; x &lt; 1) \u3068\u3059\u308b\u3068\u3001f(x) = x\/(1-x) \u3067\u3042\u308b\u3002\n    \/\/ f&amp;#039;(x) = 1\/(1-x)^2 \u3067\u3042\u308b\u3002\n    \/\/ f&amp;#039;(x) = 1 + 2x + 3x^2 + ... \u3067\u3082\u3042\u308b\u3002\n    \/\/ f&amp;#039;((N-1)\/N) = 1 + 2 * ((N-1)\/N) + 3 * ((N-1)\/N)^2 + ...\n    \/\/ f&amp;#039;((N-1)\/N) = 1\/(1-(N-1)\/N)^2 = N^2\n\n    \/\/ return N * N\n\n    \/\/ 1\u7a2e\u985e\u76ee\u306f\u5fc5\u305a1\u56de\u3067\u51fa\u308b\n    \/\/ 2\u7a2e\u985e\u76ee\u304c\u51fa\u308b\u307e\u3067\u306e\u8a66\u884c\u56de\u6570\u306e\u671f\u5f85\u5024\u306f\u3001k \u56de1\u7a2e\u985e\u304c\u51fa\u7d9a\u3051\u3001\u305d\u306e\u6b21\u306b\u5225\u306e\u7a2e\u985e\uff082\u7a2e\u985e\u76ee\uff09\u304c\u51fa\u308b\u78ba\u7387\u3092\u5229\u7528\u3057\u3066\u8a08\u7b97\u3059\u308b\u3002\n    \/\/ f&amp;#039;( 1\/N ) = N^2 \/ (N-1)^2\n    \/\/ \u3088\u308a\u3001(N-1)\/N * N^2 \/ (N-1)^2 = N \/ (N-1)\n    \/\/ 3\u7a2e\u985e\u76ee\u304c\u51fa\u308b\u307e\u3067\u306e\u8a66\u884c\u56de\u6570\u306e\u671f\u5f85\u5024\u306f\u3001k \u56de2\u7a2e\u985e\u306e\u3069\u3061\u3089\u304b\u304c\u51fa\u7d9a\u3051\u3001\u305d\u306e\u6b21\u306b\u305d\u308c\u4ee5\u5916\u306e\u3044\u305a\u308c\u304b\uff083\u7a2e\u985e\u76ee\uff09\u304c\u51fa\u308b\u78ba\u7387\u3092\u5229\u7528\u3057\u3066\u8a08\u7b97\u3059\u308b\u3002\n    \/\/ (N-2)\/N * f&amp;#039;( 2\/N ) = N\/(N-2)\n    \/\/ N-1\u7a2e\u985e\u63c3\u3063\u305f\u5f8c\u3067\u3001N\u7a2e\u985e\u76ee\u304c\u51fa\u308b\u307e\u3067\u306e\u8a66\u884c\u56de\u6570\u306e\u671f\u5f85\u5024\u306f\u3001k \u56deN-1\u7a2e\u985e\u306e\u3069\u308c\u304b\u304c\u51fa\u7d9a\u3051\u3001\u305d\u306e\u6b21\u306b\u305d\u306eN\u7a2e\u985e\u76ee\u304c\u51fa\u308b\u78ba\u7387\u3092\u5229\u7528\u3057\u3066\u8a08\u7b97\u3059\u308b\u3002\n    \/\/ 1\/N * f&amp;#039;( (N-1)\/N ) = N\n    \/\/ \u3088\u308a\u3001N\/N + N\/(N-1) + N\/(N-2) + ... + N\/2 + N\/1 = N * (1 + 1\/2 + 1\/3 + ... + 1\/N)\n    for (ll i = N; i &gt;= 1; i--) {\n      ans += 1.0 * N \/ (ld)i;\n    }\n    return ans;\n    }\n};\n\nint main(){\n  cin.tie(nullptr);\n  ios_base::sync_with_stdio(false);\n\n  p037 solver;\n  ll N;\n  cin &gt;&gt; N;\n\n  cout&lt;&lt;solver.run(N)&lt;&lt;endl;\n\n  return 0;\n}&lt;\/code&gt;&lt;\/pre&gt;\n&lt;pre&gt;&lt;code class=&quot;language-p037.cc&quot;&gt;#include &lt;algorithm&gt;\n#include &lt;bitset&gt;\n#include &lt;complex&gt;\n#include &lt;deque&gt;\n#include &lt;exception&gt;\n#include &lt;fstream&gt;\n#include &lt;functional&gt;\n#include &lt;iomanip&gt;\n#include &lt;ios&gt;\n#include &lt;iosfwd&gt;\n#include &lt;iostream&gt;\n#include &lt;istream&gt;\n#include &lt;iterator&gt;\n#include &lt;limits&gt;\n#include &lt;list&gt;\n#include &lt;locale&gt;\n#include &lt;map&gt;\n#include &lt;memory&gt;\n#include &lt;new&gt;\n#include &lt;numeric&gt;\n#include &lt;ostream&gt;\n#include &lt;queue&gt;\n#include &lt;set&gt;\n#include &lt;sstream&gt;\n#include &lt;stack&gt;\n#include &lt;stdexcept&gt;\n#include &lt;streambuf&gt;\n#include &lt;string&gt;\n#include &lt;typeinfo&gt;\n#include &lt;utility&gt;\n#include &lt;valarray&gt;\n#include &lt;vector&gt;\n\n#if __cplusplus &gt;= 201103L\n#include &lt;array&gt;\n#include &lt;atomic&gt;\n#include &lt;chrono&gt;\n#include &lt;condition_variable&gt;\n#include &lt;forward_list&gt;\n#include &lt;future&gt;\n#include &lt;initializer_list&gt;\n#include &lt;mutex&gt;\n#include &lt;random&gt;\n#include &lt;ratio&gt;\n#include &lt;regex&gt;\n#include &lt;scoped_allocator&gt;\n#include &lt;system_error&gt;\n#include &lt;thread&gt;\n#include &lt;tuple&gt;\n#include &lt;typeindex&gt;\n#include &lt;type_traits&gt;\n#include &lt;unordered_map&gt;\n#include &lt;unordered_set&gt;\n#endif\n\nusing namespace std;\n\nusing ll = long long;\nusing ld = long double;\nconst ll MOD1 = 1e9+7;\nconst ll MOD9 = 998244353;\nconst ll INF = 1e18;\nusing P = pair&lt;ll, ll&gt;;\ntemplate&lt;typename T&gt; using PQ = priority_queue&lt;T&gt;;\ntemplate&lt;typename T&gt; using QP = priority_queue&lt;T,vector&lt;T&gt;,greater&lt;T&gt;&gt;;\n\nstruct p037 {\n  string run(ld x1, ld y1, ld x2, ld y2, ld x3, ld y3, ld x4, ld y4) {\n    ld tcross1 = (x1-x2)*(y3-y1) + (y1-y2)*(x1-x3);\n    ld dcross1 = (x1-x2)*(y4-y1) + (y1-y2)*(x1-x4);\n    ld tcross2 = (x3-x4)*(y1-y3) + (y3-y4)*(x3-x1);\n    ld dcross2 = (x3-x4)*(y2-y3) + (y3-y4)*(x3-x2);\n\n    \/\/ \u5fd8\u308c\u3066\u3044\u305f\n    \/\/ \u30b3\u30fc\u30ca\u30fc\u30b1\u30fc\u30b9\n    \/\/ \u70b9\u306b\u9806\u5e8f\u3092\u5165\u308c\u305f\u65b9\u304c\u826f\u3044\n    if (tcross1 == 0 &amp;&amp; dcross1 == 0 &amp;&amp; tcross2 == 0 &amp;&amp; dcross2 == 0) {\n      if (x1 &gt; x2 || (x1 == x2 &amp;&amp; y1 &gt; y2)) {\n        swap(x1, x2);\n        swap(y1, y2);\n      }\n      if (x3 &gt; x4 || (x3 == x4 &amp;&amp; y3 &gt; y4)) {\n        swap(x3, x4);\n        swap(y3, y4);\n      }\n      ld lx, ly, rx, ry;\n      if (x1 == x3) {\n        lx = x1;\n        ly = max(y1, y3);\n      } else if (x1 &lt; x3) {\n        lx = x3;\n        ly = y3;\n      } else {\n        lx = x1;\n        ly = y1;\n      }\n      if (x2 == x4) {\n        rx = x2;\n        ry = min(y2, y4);\n      } else if (x2 &lt; x4) {\n        rx = x2;\n        ry = y2;\n      } else {\n        rx = x4;\n        ry = y4;\n      }\n      if (lx &lt; rx || (lx == rx &amp;&amp; ly &lt; ry)) {\n        return &quot;Yes&quot;;\n      }\n      return &quot;No&quot;;\n    }\n\n    \/\/ P1P2, P3P4 \u306e\u4ea4\u5dee\u5224\u5b9a\n    \/\/ \u76f4\u7ddaP1P2\u306b\u5bfe\u3057\u3066P3P4\u306e\u4e21\u7aef\u304c\u7570\u306a\u308b\u5074\u306b\u3042\u308b\u304b\u3069\u3046\u304b\u3092\u5224\u5b9a\n    if (tcross1*dcross1 &lt; 0) {\n      \/\/ P3P4, P1P2 \u306e\u4ea4\u5dee\u5224\u5b9a\n      \/\/ \u76f4\u7ddaP3P4\u306b\u5bfe\u3057\u3066P1P2\u306e\u4e21\u7aef\u304c\u7570\u306a\u308b\u5074\u306b\u3042\u308b\u304b\u3069\u3046\u304b\u3092\u5224\u5b9a\n      if (tcross2*dcross2 &lt; 0) return &quot;Yes&quot;;\n    }\n\n    return &quot;No&quot;;\n    }\n};\n\nint main(){\n  cin.tie(nullptr);\n  ios_base::sync_with_stdio(false);\n\n  p037 solver;\n  ld x1, x2, x3, x4, y1, y2, y3, y4;\n  cin &gt;&gt; x1 &gt;&gt; y1 &gt;&gt; x2 &gt;&gt; y2 &gt;&gt; x3 &gt;&gt; y3 &gt;&gt; x4 &gt;&gt; y4;\n\n  cout&lt;&lt;solver.run(x1, y1, x2, y2, x3, y3, x4, y4)&lt;&lt;endl;\n\n  return 0;\n}&lt;\/code&gt;&lt;\/pre&gt;\n&lt;p&gt;&quot;&quot;&quot;&lt;\/p&gt;\n&lt;p&gt;def send_request(num_ctx):\ndata = {\n&quot;model&quot;: &quot;llama3-gradient:8b&quot;,\n&quot;prompt&quot;: system + code,\n&quot;stream&quot;: False,\n&quot;options&quot;: {\n&quot;num_ctx&quot;: num_ctx,\n}\n}\nnow = datetime.datetime.now()\nresponse = requests.post(\n&quot;https:\/\/ service URL .us-central1.run.app\/api\/generate&quot;,\ndata=json.dumps(data),\nheaders={&quot;Content-Type&quot;: &quot;application\/json&quot;}\n)\nprint(f&quot;num_ctx={num_ctx}, status_code={response.status_code}&quot;)\ntry:\nres_json = response.json()\nexcept Exception as e:\nprint(f&quot;num_ctx={num_ctx}, \u30ec\u30b9\u30dd\u30f3\u30b9JSON\u30c7\u30b3\u30fc\u30c9\u5931\u6557: {e}&quot;)\nreturn None, None, None\nduration = res_json.get(&quot;total_duration&quot;, None)&lt;\/p&gt;\n&lt;h1&gt;\u30ed\u30b0\u51fa\u529b\u306f\u305d\u306e\u307e\u307e&lt;\/h1&gt;\n&lt;pre&gt;&lt;code&gt;with open(f&quot;output_{num_ctx}_&quot; + now.strftime(&#039;%Y-%m-%d_%H%M%S&#039;) + &quot;.json&quot;, &quot;w&quot;) as data_file:\n    json.dump(res_json, data_file, indent=2)\nwith open(f&quot;output_{num_ctx}_&quot; + now.strftime(&#039;%Y-%m-%d_%H%M%S&#039;) + &quot;_response.txt&quot;, &quot;w&quot;) as data_file:\n    data_file.write(res_json.get(&quot;response&quot;, &quot;&quot;))\nreturn duration, response.status_code, res_json&lt;\/code&gt;&lt;\/pre&gt;\n&lt;p&gt;def is_gpu_only(duration_ns):&lt;\/p&gt;\n&lt;h1&gt;30\u79d2 = 30_000_000_000ns&lt;\/h1&gt;\n&lt;pre&gt;&lt;code&gt;if duration_ns is None:\n    return None\nreturn duration_ns &lt;= 30_000_000_000&lt;\/code&gt;&lt;\/pre&gt;\n&lt;p&gt;def binary_search_num_ctx(min_ctx, max_ctx):\nleft = min_ctx\nright = max_ctx\nresult = None\nwhile left &lt;= right:&lt;\/p&gt;\n&lt;h1&gt;mid = (left + right) \/\/ 2&lt;\/h1&gt;\n&lt;pre&gt;&lt;code&gt;    # \u5076\u6570\u306e\u307f\u3067\u63a2\u7d22\n    mid = ((left + right) \/\/ 2) \/\/ 2 * 2\n    print(f&quot;\\n=== num_ctx={mid} \u3067\u30ea\u30af\u30a8\u30b9\u30c8\u9001\u4fe1 ===&quot;)\n    duration, status_code, res_json = send_request(mid)\n    print(f&quot;\\ntook {duration\/1_000_000_000} s&quot;)\n    if duration is None:\n        print(f&quot;num_ctx={mid} \u3067\u30ec\u30b9\u30dd\u30f3\u30b9\u53d6\u5f97\u5931\u6557\u3002\u7d42\u4e86\u3057\u307e\u3059\u3002&quot;)\n        break\n    print(f&quot;num_ctx={mid}, total_duration={duration} ns, status_code={status_code}&quot;)\n    if is_gpu_only(duration):\n        print(f&quot;num_ctx={mid} : GPU\u306e\u307f\u3067\u51e6\u7406 (20\u79d2\u4ee5\u4e0b) \u2192 num_ctx\u3092\u5897\u3084\u3059&quot;)\n        result = mid\n        left = mid + 2\n    else:\n        print(f&quot;num_ctx={mid} : CPU\u4f75\u7528 (20\u79d2\u8d85) \u2192 num_ctx\u3092\u6e1b\u3089\u3059&quot;)\n        right = mid - 2\n    time.sleep(15)  # \u30b5\u30fc\u30d0\u30fc\u8ca0\u8377\u8efd\u6e1b\u306e\u305f\u308115\u79d2\u5f85\u6a5f\nprint(f&quot;\\nGPU\u306e\u307f\u3067\u51e6\u7406\u3067\u304d\u308b\u6700\u5927\u306enum_ctx: {result}&quot;)\nreturn result&lt;\/code&gt;&lt;\/pre&gt;\n&lt;p&gt;if &lt;strong&gt;name&lt;\/strong&gt; == &quot;&lt;strong&gt;main&lt;\/strong&gt;&quot;:&lt;\/p&gt;\n&lt;h1&gt;\u4f8b: 20480 \u301c 24576 \u306e\u9593\u3067\u63a2\u7d22&lt;\/h1&gt;\n&lt;pre&gt;&lt;code&gt;min_ctx = 20480\nmax_ctx = 24576\nbinary_search_num_ctx(min_ctx, max_ctx)&lt;\/code&gt;&lt;\/pre&gt;\n&lt;pre&gt;&lt;code&gt;\n-----\n\n# Context Window Limits When Using a GPU on Cloud Run, Part 1\n\n## Conclusion\n\nWhen accessing a service created on Cloud Run with the [&lt;code&gt;llama3-gradient:8b&lt;\/code&gt;](https:\/\/ollama.com\/library\/llama3-gradient) model and an NVIDIA L4 GPU, responses often returned in about **10 seconds** for &lt;code&gt;num_ctx&lt;\/code&gt; values up to around **15,000**.\n\nWhen further increasing &lt;code&gt;num_ctx&lt;\/code&gt;, performance degradation could be observed starting around **19,000**. In the scope of these simple tests, setting &lt;code&gt;num_ctx&lt;\/code&gt; to **20,970 or higher** never resulted in a response time under 20 seconds.\n\nIt seems safe to say that the **performance limit is around 20,970** or slightly below it.\n\nBased on the log output, it appears that the processing exceeds what the GPU alone can handle somewhere between num_ctx values of 21,503 and 22,528.\n\n## Introduction\n\nCloud Run has a feature that allows you to use GPUs. It became available in Preview on 2024-08-21 and became generally available on 2025-04-07. I have been participating in the preview since late 2024 to conduct research and tests.\n\nThe advantages of self-hosting an LLM, rather than using an LLM as a service, include the ability to stick with a specific version, customize it, and prevent information leaks.\n\nUsing Cloud Run allows you to use resources &quot;only when you need them, and only as much as you need,&quot; enabling you to host an LLM cost-effectively. The bottleneck here is the amount of GPU memory. For tasks like code reviews, you want to get answers while referencing the widest possible range of source code, so the length of the context window is important. On the other hand, since GPU memory is limited, there is a restriction on the context window length.\n\nThe LLMs themselves also have inherent context window limits. In this article, I used a specific model (&lt;code&gt;llama3-gradient:8b&lt;\/code&gt;) and increased the &lt;code&gt;num_ctx&lt;\/code&gt; parameter, confirming that beyond a certain point, the processing time increases (likely because when the task exceeds what the GPU can handle alone, the CPU is also utilized). Due to large fluctuations, I won&#039;t post detailed data, but I will introduce the general trend.\n\nIn this article, I will present the results of my investigation into the &lt;code&gt;llama3-gradient:8b&lt;\/code&gt; model. In the next article, I plan to cover Gemma 3.\n\n## Details\n\nMy &lt;code&gt;Dockerfile&lt;\/code&gt; is as follows. It uses the Ollama image and pre-downloads the necessary models. The models are included in the container image. &lt;code&gt;gemma2:9b&lt;\/code&gt; and &lt;code&gt;codegemma:7b-instruct&lt;\/code&gt; were included for other tests.\n\n```Dockerfile:Dockerfile\nFROM ollama\/ollama\n\n# Listen on all interfaces, port 8080\nENV OLLAMA_HOST 0.0.0.0:8080\n\n# Store model weight files in \/models\nENV OLLAMA_MODELS \/models\n\n# Reduce logging verbosity\nENV OLLAMA_DEBUG false\n\n# Never unload model weights from the GPU\nENV OLLAMA_KEEP_ALIVE -1 \n\n# Store the model weights in the container image\nENV MODEL1 gemma2:9b\nENV MODEL2 codegemma:7b-instruct\nENV MODEL3 llama3-gradient:8b\nRUN ollama serve &amp; sleep 5 &amp;&amp; ollama pull $MODEL1 &amp;&amp; ollama pull $MODEL2 &amp;&amp; ollama pull $MODEL3\n\n# Start Ollama\nENTRYPOINT [&quot;ollama&quot;, &quot;serve&quot;]<\/code><\/pre>\n<p>I created a Cloud Run service with the image built from this Dockerfile using the following settings:<\/p>\n<ul>\n<li><strong>Startup CPU boost<\/strong>: Enabled<\/li>\n<li><strong>Concurrency<\/strong>: 4<\/li>\n<li><strong>CPU limit<\/strong>: 8<\/li>\n<li><strong>Memory limit<\/strong>: 32 GiB<\/li>\n<li><strong>GPU<\/strong>: 1 NVIDIA L4 (no zonal redundancy)<\/li>\n<li><code>OLLAMA_NUM_PARALLEL<\/code>: 4<\/li>\n<\/ul>\n<p>Regarding <code>OLLAMA_NUM_PARALLEL<\/code>, a value of 1 might have been better, but I had already configured it this way, so I proceeded with the tests. Some models allow you to change the context window by passing <code>num_ctx<\/code> in the <code>options<\/code>. <code>llama3-gradient:8b<\/code> is one such model.<\/p>\n<p>The documentation for <a href=\"https:\/\/ollama.com\/library\/llama3-gradient\">llama3-gradient<\/a> states:<\/p>\n<blockquote>\n<p>This model extends LLama-3 8B&#8217;s context length from 8k to over 1m tokens.<\/p>\n<\/blockquote>\n<p>This article, including the code sections, exceeds 15,000 characters, so an 8k token limit would likely be insufficient to process the entire text. With 1 million, there is plenty of room.<\/p>\n<p>The core code is as follows. It sends a request specifying the <code>num_ctx<\/code>. For finer details, please refer to the <a href=\"#code\">Code<\/a> section.<\/p>\n<pre><code class=\"language-python\">data = {\n    &quot;model&quot;: &quot;llama3-gradient:8b&quot;,\n    &quot;prompt&quot;: system + code,\n    &quot;stream&quot;: False,\n    &quot;options&quot;: {\n        &quot;num_ctx&quot;: num_ctx,\n    }\n}\n\nresponse = requests.post(\n    &quot;https:\/\/ service URL .us-central1.run.app\/api\/generate&quot;,\n    data=json.dumps(data),\n    headers={&quot;Content-Type&quot;: &quot;application\/json&quot;}\n)<\/code><\/pre>\n<p>Performance can be unstable, but when it&#8217;s fast, responses come back in about 10 seconds. When it&#8217;s slow, it can take over 30 seconds, sometimes even close to 300 seconds. For example, with a <code>num_ctx<\/code> of 20,975, one request took 271 seconds.<\/p>\n<p>Given that fast responses are around 10 seconds, I decided to classify anything taking longer than 20 seconds as &quot;slow.&quot; Note that at one point, I changed the timeout to 30 seconds while running the loop.<\/p>\n<p>In Keita Sato&#8217;s article, &quot;<a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">Testing the Performance of Cloud Run GPU + Ollama gemma2<\/a>,&quot; a load test was performed on <code>gemma2:9b<\/code> with k6 without specifying a context window during parallel access, and the result was:<\/p>\n<blockquote>\n<p>response time P95 45s<\/p>\n<\/blockquote>\n<p>Looking at the images in the article, there is an example where the AVG is 20s, so it seems reasonable to assume a &quot;normal&quot; response time is around 20 seconds.<\/p>\n<p>For some <code>num_ctx<\/code> values, a response never came back, even after multiple attempts (strangely, it didn&#8217;t seem to be a simple linear relationship). For example, with <code>num_ctx<\/code> at <strong>20,966<\/strong>, it came back in under 20 seconds multiple times, once even in 9.9 seconds.<\/p>\n<p>However, a <code>num_ctx<\/code> of <strong>18,750<\/strong> also timed out on occasion.<br \/>\nWith <code>num_ctx<\/code> at <strong>20,970<\/strong>, I made three requests: two timed out, and one took 32 seconds.<br \/>\nWith <code>num_ctx<\/code> at <strong>20,964<\/strong>, I made six requests: two timed out, and four took 12, 13, 31, and 26 seconds, respectively. It&#8217;s unstable.<\/p>\n<p>I have data showing that <code>num_ctx<\/code> 20,971 took 31 seconds, 20,975 took 271 seconds (note: the number is correct), 20,991 took 27 seconds, 21,503 took 46 seconds, and 22,528 took 45 seconds. I had no data for a <code>num_ctx<\/code> over 20,970 that completed in under 20 seconds. From this, I concluded that the limit is around <strong>20,970<\/strong>.<\/p>\n<p>In &quot;<a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">Google Releases Gemma 3! -Tried Running It on Cloud Run-<\/a>&quot; by Muramatsu of Cloud Ace, the models tested (<code>gemma2:27b<\/code>, <code>gemma3:12b<\/code>) are different from this article, but tests were run with <code>num_ctx<\/code> set to 16,384. This suggests that even these models return a response within the default 300-second timeout or a slightly extended one.<\/p>\n<h3>offload&#8230; to GPU<\/h3>\n<p>Apart from the response time, you can also observe the following output in the logs:<\/p>\n<pre><code class=\"language-text\">llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)\nllm_load_tensors: ggml ctx size = 0.27 MiB\nllm_load_tensors: offloading 32 repeating layers to GPU\nllm_load_tensors: offloading non-repeating layers to GPU\nllm_load_tensors: offloaded 33\/33 layers to GPU\nllm_load_tensors: CPU buffer size = 281.81 MiB\nllm_load_tensors: CUDA0 buffer size = 4155.99 MiB\nllama_kv_cache_init: CUDA0 KV buffer size = 10752.00 MiB\nllama_new_context_with_model: KV self size = 10752.00 MiB, K (f16): 5376.00 MiB, V (f16): 5376.00 MiB\nllama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB\nllama_new_context_with_model: CUDA0 compute buffer size = 5576.00 MiB\nllama_new_context_with_model: CUDA_Host compute buffer size = 176.01 MiB\nllama_new_context_with_model: graph nodes = 1030<\/code><\/pre>\n<p>In the following log, seeing &quot;offloaded 32\/33 layers to GPU&quot; suggests that one layer is not on the GPU. In fact, the line &quot;offloading non-repeating layers to GPU&quot; is missing.<\/p>\n<pre><code class=\"language-text\">llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)\nllm_load_tensors: ggml ctx size = 0.27 MiB\nllm_load_tensors: offloading 32 repeating layers to GPU\nllm_load_tensors: offloaded 32\/33 layers to GPU\nllm_load_tensors: CPU buffer size = 4437.80 MiB\nllm_load_tensors: CUDA0 buffer size = 3745.00 MiB\nllama_kv_cache_init: CUDA0 KV buffer size = 11264.00 MiB\nllama_new_context_with_model: KV self size = 11264.00 MiB, K (f16): 5632.00 MiB, V (f16): 5632.00 MiB\nllama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB\nllama_new_context_with_model: CUDA0 compute buffer size = 5840.00 MiB\nllama_new_context_with_model: CUDA_Host compute buffer size = 184.01 MiB\nllama_new_context_with_model: graph nodes = 1030<\/code><\/pre>\n<p>The first log is for <code>num_ctx = 21503<\/code>, and the second is for <code>num_ctx = 22528<\/code>. From this, it appears that when <code>num_ctx<\/code> is around 22000, the processing exceeds what the GPU can handle.The first log is for <code>num_ctx = 21503<\/code>, and the second is for <code>num_ctx = 22528<\/code>. From this, it appears that when <code>num_ctx<\/code> is around 22000, the processing exceeds what the GPU can handle.<\/p>\n<h2>Summary<\/h2>\n<p>When accessing a service created on Cloud Run with the <code>llama3-gradient:8b<\/code> model and an NVIDIA L4 GPU, responses often returned in about <strong>10 seconds<\/strong> for <code>num_ctx<\/code> values up to around <strong>15,000<\/strong>.<\/p>\n<p>When further increasing <code>num_ctx<\/code>, performance degradation could be observed starting around <strong>19,000<\/strong>. In the scope of these simple tests, setting <code>num_ctx<\/code> to <strong>20,970 or higher<\/strong> never resulted in a response time under 20 seconds.<\/p>\n<p>It seems safe to say that the <strong>performance limit is around 20,970<\/strong> or slightly below it.<\/p>\n<p>Perhaps due to the inherent randomness of the output, the results were not stable. This is why I used the ambiguous phrasing &quot;performance degradation could be observed.&quot;<\/p>\n<h2>References<\/h2>\n<ul>\n<li><a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#August_21_2024\">Release Notes 2024-08-21<\/a> You can now configure GPU in your Cloud Run service (Preview). <a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#August_21_2024\">https:\/\/cloud.google.com\/run\/docs\/release-notes#August_21_2024<\/a><\/li>\n<li><a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">Release Notes 2025-04-07<\/a> Configuring GPU in your Cloud Run service is now generally available (GA). <a href=\"https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025\">https:\/\/cloud.google.com\/run\/docs\/release-notes#April_07_2025<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">Testing the Performance of Cloud Run GPU + Ollama gemma2<\/a> by Keita Sato. <a href=\"https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74\">https:\/\/zenn.dev\/satohjohn\/articles\/912b4c718a8d74<\/a><\/li>\n<li><a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">Google Releases Gemma 3! -Tried Running It on Cloud Run-<\/a> by Muramatsu. <a href=\"https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2\">https:\/\/zenn.dev\/cloud_ace\/articles\/cloud-run-gpu-ollama-2<\/a><\/li>\n<\/ul>\n<h2>Code<\/h2>\n<p>See Japanese version for code snippets.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Cloud Run \u3067 GPU \u3092\u4f7f\u3046\u3068\u304d\u306e context window \u306e\u9650\u754c \u305d\u306e1 English follows Japanese. \u7d50\u8ad6 Cloud Run \u3067 llama3-gradient:8b \u30e2\u30c7\u30eb\u3068 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_post_was_ever_published":false},"categories":[26,13],"tags":[],"class_list":["post-610","post","type-post","status-publish","format-standard","hentry","category-cloud-run","category-google-cloud"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p4dIdP-9Q","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts\/610","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/comments?post=610"}],"version-history":[{"count":3,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts\/610\/revisions"}],"predecessor-version":[{"id":613,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/posts\/610\/revisions\/613"}],"wp:attachment":[{"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/media?parent=610"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/categories?post=610"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tako.nakano.net\/blog\/wp-json\/wp\/v2\/tags?post=610"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}