4 Pitfalls Discovered After Migrating from Anthropic to Gemini
Why the Switch?
The monthly API costs for running Anthropic Claude Sonnet 4.6 became a significant burden. Even downgrading to Haiku within the same model family still left the cost per token prohibitively high.
After re-evaluating the pricing:
| Model | Input | Output |
|---|---|---|
| Claude Sonnet 4.6 | $3.00 / 1M | $15.00 / 1M |
| Claude Haiku 4.5 | $0.80 / 1M | $4.00 / 1M |
| Gemini 2.5 Flash (non-thinking) | $0.15 / 1M | $0.60 / 1M |
| Gemini Flash-Lite | $0.075 / 1M | $0.30 / 1M |
My own tests showed that Gemini 2.5 Flash was **20x cheaper** than Sonnet, with similar Korean language quality. The decision was made to switch.
The theory was clean. In reality, four traps awaited.
Trap 1: If `thinking_budget` isn't set to 0, search breaks
gemini-2.5-flash has thinking mode enabled by default. When this is on:
- Response speed slows down (~2x)
- Costs increase ($0.60 → $3.50 / 1M output)
- And most frustratingly, the `google_search` tool trigger weakens
The symptom: For time-sensitive questions like "What's today's exchange rate?", it would answer using its own training data instead of triggering a search.
After 3 hours of debugging, I found the solution:
config = gtypes.GenerateContentConfig(
system_instruction=system_prompt,
tools=[gtypes.Tool(google_search=gtypes.GoogleSearch())],
max_output_tokens=8192,
temperature=0.7,
thinking_config=gtypes.ThinkingConfig(thinking_budget=0), # ← This
)
Explicitly setting thinking_budget=0 completely turns off thinking. The model responds quickly, like Flash-Lite, and the search trigger works correctly.
Trap 2: Nightly batch job analyzes new users every turn
This was a code bug unique to our service, but I've seen similar patterns often.
Problematic code:
last_count = (existing or {}).get("message_count_at_analysis") or 0
if last_count > 0 and len(messages) - last_count < 5:
return # ← Skip if less than 5 turns
This looks logical but contains a trap. For new users, `last_count` is 0, so the condition always evaluates to `False`. This means the analysis function runs on every chat turn.
The analysis function makes two Gemini API calls (profile JSON generation + injection text generation). With 200 messages as input, the cost per call is not insignificant.
If a few new users chat actively for two days:
- 1 user × 20 turns × 2 API calls × ~3 KRW = 120 KRW / user
- The nightly batch also re-analyzes all users daily without interval checks → hundreds of won more
Over two days, we spent over 1,000 KRW.
Correction:
if last_count == 0:
if len(messages) < 10: # First analysis only if 10+ messages
return
else:
if len(messages) - last_count < 20: # After that, 20-turn interval
return
Additionally, I reduced the message input limit from 200 → 60 and the truncation per message from 300 → 200 tokens. This resulted in about an 80-90% cost reduction.
Trap 3: Incorrectly set `gemini-2.5-flash` pricing
I made a mistake when entering the pricing into the internal cost tracking dictionary MODEL_PRICING:
# Incorrect value (thinking mode price)
"gemini-2.5-flash": {"input": 0.30, "output": 2.50},
# Correct value (non-thinking mode, with thinking_budget=0 applied)
"gemini-2.5-flash": {"input": 0.15, "output": 0.60},
Google's pricing page lists both thinking and non-thinking prices together, which was confusing. Since I turned off thinking in Trap 1, I should have applied the non-thinking price.
If this isn't caught, the cost graph on the admin page will show 4x higher than reality. This directly impacts decision-making.
Trap 4: Migrated, but credit deduction rate remained unchanged
The rate deducted from paid users was also hardcoded in a separate constant:
# Old — based on Flash-Lite
PAID_IN_KRW_PER_TOKEN = 0.075 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.30 * 1400 / 1_000_000 * 3
The main model was upgraded to 2.5 Flash, but deductions were still based on Flash-Lite pricing. Users were charged less than actual cost, and we were losing money. I didn't realize this for a long time.
Correction:
# 2.5 Flash + 3x margin
PAID_IN_KRW_PER_TOKEN = 0.15 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.60 * 1400 / 1_000_000 * 3
Furthermore, cost records from the previous Claude era remained in `usage_logs`, making statistics inconsistent. I created a "Reset Claude Costs" button on the admin page to clean this up at once.
Summary: Model Migration Checklist
A checklist for anyone doing the same thing.
- [ ] Double-check model-specific pricing pages: Thinking/non-thinking prices might differ (e.g., Gemini 2.5 Flash).
- [ ] Explicitly set `thinking_budget`: Don't rely on defaults. Set to `0` to disable, or specify the exact token count to enable.
- [ ] Regression test search/tool triggers: After changing models, re-verify that the same input yields the same behavior.
- [ ] Synchronize internal pricing tables: Both the
MODEL_PRICINGdictionary and credit deduction rates. - [ ] Policy for previous model cost data: Keep, delete, or separate into its own statistics.
- [ ] Inspect new user code paths: Check for bugs where a `count == 0` condition might disable interval checks.
- [ ] Check for overlap between batch jobs and real-time triggers: Running the same task in two places doubles costs.
Results
After migration and fixing the four traps:
- Average response speed: 1.7x faster (compared to Sonnet)
- Operational costs: ~80% reduction
- Search trigger: Works normally
- Korean language quality: No discernible difference in my own tests (blind comparison)
Discovering thinking_budget=0 took the longest. I hope you don't fall into the same trap.
※ This system is actually applied to Riel Chatbot, and costs are monitored in real-time from the administrator dashboard.