4 Pitfalls Discovered After Migrating from Anthropic to Gemini

📅 Written on 2026-05-03 — A log of real pitfalls encountered in a self-operated service

Why the Switch?

The monthly API costs for running Anthropic Claude Sonnet 4.6 became a significant burden. Even downgrading to Haiku within the same model family still left the cost per token prohibitively high.

After re-evaluating the pricing:

Model	Input	Output
Claude Sonnet 4.6	$3.00 / 1M	$15.00 / 1M
Claude Haiku 4.5	$0.80 / 1M	$4.00 / 1M
Gemini 2.5 Flash (non-thinking)	$0.15 / 1M	$0.60 / 1M
Gemini Flash-Lite	$0.075 / 1M	$0.30 / 1M

My own tests showed that Gemini 2.5 Flash was **20x cheaper** than Sonnet, with similar Korean language quality. The decision was made to switch.

The theory was clean. In reality, four traps awaited.

Trap 1: If `thinking_budget` isn't set to 0, search breaks

gemini-2.5-flash has thinking mode enabled by default. When this is on:

Response speed slows down (~2x)
Costs increase ($0.60 → $3.50 / 1M output)
And most frustratingly, the `google_search` tool trigger weakens

The symptom: For time-sensitive questions like "What's today's exchange rate?", it would answer using its own training data instead of triggering a search.

After 3 hours of debugging, I found the solution:

config = gtypes.GenerateContentConfig(
    system_instruction=system_prompt,
    tools=[gtypes.Tool(google_search=gtypes.GoogleSearch())],
    max_output_tokens=8192,
    temperature=0.7,
    thinking_config=gtypes.ThinkingConfig(thinking_budget=0),  # ← This
)

Explicitly setting thinking_budget=0 completely turns off thinking. The model responds quickly, like Flash-Lite, and the search trigger works correctly.

Trap 2: Nightly batch job analyzes new users every turn

This was a code bug unique to our service, but I've seen similar patterns often.

Problematic code:

last_count = (existing or {}).get("message_count_at_analysis") or 0
if last_count > 0 and len(messages) - last_count < 5:
    return  # ← Skip if less than 5 turns

This looks logical but contains a trap. For new users, `last_count` is 0, so the condition always evaluates to `False`. This means the analysis function runs on every chat turn.

The analysis function makes two Gemini API calls (profile JSON generation + injection text generation). With 200 messages as input, the cost per call is not insignificant.

If a few new users chat actively for two days:

1 user × 20 turns × 2 API calls × ~3 KRW = 120 KRW / user
The nightly batch also re-analyzes all users daily without interval checks → hundreds of won more

Over two days, we spent over 1,000 KRW.

Correction:

if last_count == 0:
    if len(messages) < 10:    # First analysis only if 10+ messages
        return
else:
    if len(messages) - last_count < 20:   # After that, 20-turn interval
        return

Additionally, I reduced the message input limit from 200 → 60 and the truncation per message from 300 → 200 tokens. This resulted in about an 80-90% cost reduction.

Trap 3: Incorrectly set `gemini-2.5-flash` pricing

I made a mistake when entering the pricing into the internal cost tracking dictionary MODEL_PRICING:

# Incorrect value (thinking mode price)
"gemini-2.5-flash": {"input": 0.30, "output": 2.50},
Correct value (non-thinking mode, with thinking_budget=0 applied)
"gemini-2.5-flash": {"input": 0.15, "output": 0.60},

Google's pricing page lists both thinking and non-thinking prices together, which was confusing. Since I turned off thinking in Trap 1, I should have applied the non-thinking price.

If this isn't caught, the cost graph on the admin page will show 4x higher than reality. This directly impacts decision-making.

Trap 4: Migrated, but credit deduction rate remained unchanged

The rate deducted from paid users was also hardcoded in a separate constant:

# Old — based on Flash-Lite
PAID_IN_KRW_PER_TOKEN  = 0.075 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.30  * 1400 / 1_000_000 * 3

The main model was upgraded to 2.5 Flash, but deductions were still based on Flash-Lite pricing. Users were charged less than actual cost, and we were losing money. I didn't realize this for a long time.

Correction:

# 2.5 Flash + 3x margin
PAID_IN_KRW_PER_TOKEN  = 0.15 * 1400 / 1_000_000 * 3
PAID_OUT_KRW_PER_TOKEN = 0.60 * 1400 / 1_000_000 * 3

Furthermore, cost records from the previous Claude era remained in `usage_logs`, making statistics inconsistent. I created a "Reset Claude Costs" button on the admin page to clean this up at once.

Summary: Model Migration Checklist

A checklist for anyone doing the same thing.

[ ] Double-check model-specific pricing pages: Thinking/non-thinking prices might differ (e.g., Gemini 2.5 Flash).
[ ] Explicitly set `thinking_budget`: Don't rely on defaults. Set to `0` to disable, or specify the exact token count to enable.
[ ] Regression test search/tool triggers: After changing models, re-verify that the same input yields the same behavior.
[ ] Synchronize internal pricing tables: Both the MODEL_PRICING dictionary and credit deduction rates.
[ ] Policy for previous model cost data: Keep, delete, or separate into its own statistics.
[ ] Inspect new user code paths: Check for bugs where a `count == 0` condition might disable interval checks.
[ ] Check for overlap between batch jobs and real-time triggers: Running the same task in two places doubles costs.

Results

After migration and fixing the four traps:

Average response speed: 1.7x faster (compared to Sonnet)
Operational costs: ~80% reduction
Search trigger: Works normally
Korean language quality: No discernible difference in my own tests (blind comparison)

Discovering thinking_budget=0 took the longest. I hope you don't fall into the same trap.

※ This system is actually applied to Riel Chatbot, and costs are monitored in real-time from the administrator dashboard.