AI Agents and Applesauce
I’ve been passively working on a recipe search engine over the past few years (code named Foodie) which has led me into all sorts of interesting sub-problems.
One of those areas is parsing ingredient labels and calculating accurate nutrition information. At the surface this sounds straightforward.
A common label I use is 1/3 cup applesauce. A human being will be able to say:
- the food is applesauce
- the measurement is 1/3 cup
Even this simple example has issues when trying to parse programmatically. For example:
1/3may be represented different ways in different recipes, e.g.1/3,.333, and "One third"Applesauce,applesauce, andAPPLESAUCEare all equally valid and all equally found in the wild
I’ve gone through multiple revisions of ingredient parsing — naive regex, rules matching, and all the way to an ML pipeline using PyTorch. Each solution worked some of the time but had noticeable functionality gaps.
In a surprise to no one, LLMs (or as my children say “your robot friend”) have been extremely helpful in this area and allowed for some exciting breakthroughs. That said, Foodie is a passion project with limited budget, I needed to plan for cost efficiency up-front.
OK, enough fluff and onto the nerd stuff.
The Real Problem
To give some scale to the problem, Foodie has indexed around 60,000 ingredient strings so far. Each one needs to be broken down into a quantity, a unit, and a food name, then matched against a nutrition database to pull calories, protein, fat, and 30+ other nutrient fields. That’s a lot of applesauce.
The thing about ingredient strings is that they're not structured data. They're just whatever the recipe author felt like typing. Here's a sample of what you actually run into:
| Input | What makes it hard |
|---|---|
0.5 cup olive oil | Nothing, this is the easy one |
¼ cup applesauce | Unicode fraction (U+00BC) instead of 1/4 |
1 (14 ounce) can EAGLE BRAND® Sweetened Condensed Milk | Nested quantity, brand name, trademark symbol |
0.25 lemon, juiced | The quantity is for the lemon but the food is lemon juice |
salt and ground black pepper to taste | Two ingredients in one string, no measurable quantity |
0.5 (16 ounce) package frozen mixed vegetables, thawed | Fractional package with a parenthetical container size |
1 cup butter, divided | Preparation instruction that doesn't change the weight |
The straightforward cases like 2 cups flour or 3 large eggs make up maybe 40% of real ingredient strings. The rest need increasingly creative interpretation.
The Resolution Funnel
Rather than trying to build one parser that handles all of this, I ended up with a three-tier resolution funnel. Every ingredient passes through each tier in order:
- L1 cache — an exact-match lookup on the normalized label. If we've seen this exact string before, we already have the answer.
- L2 identity store — a food-level cache that knows portion tables. It doesn't need an exact label match, just a recognized food name and a known unit. So if we've resolved any butter label before, L2 can handle
2 tbsp butterwithout calling the agent. - AI agent — a Claude Haiku 4.5 agent that searches nutrition databases, identifies the food, and estimates the weight. Only gets called when both caches miss.
Each tier is cheaper and faster than the next, so the expensive agent work only fires on ingredients that genuinely need it.
Here's what it looks like when 1/2 cup unsalted butter, softened goes through the funnel on a warm cache:
This one never touches the agent. That's the nice thing about this setup: one expensive resolution makes every future butter label free. The end result is that about 85% of all ingredients resolve without a single LLM call.
Normalization
Before anything hits the cache or the agent, I run a normalization pass. The idea is simple: make sure equivalent inputs produce the same cache key. Take these three strings:
3 large eggs
3 large egg
3 Large EggsSame ingredient, three different strings. Without normalization there would be a cache miss on two of them, resulting in unnecessary agent calls.
The normalization pipeline runs through a few transforms:
- Lowercase —
3 Large Eggs→3 large eggs - Unicode fraction expansion —
¼ cup→1/4 cup,1½ cups→1 1/2 cups - Whitespace collapse — multiple spaces, tabs, non-breaking spaces → single space
- Singular form —
eggs→egg,tomatoes→tomato,cherries→cherry - Trademark stripping —
EAGLE BRAND®→eagle brand
The singularization step has an exception list for words that end in 's' but aren't plural: molasses, hummus, couscous, asparagus, lemongrass. Without it you'd normalize hummus to hummu, which is kind of funny but definitely wrong.
Stripping Preparation Instructions
Recipe authors love tacking cooking instructions onto ingredient strings:
1 cup butter, melted
2 cups chicken breast, diced and seasoned
1 cup fresh mozzarella, drained and sliced
½ cup pecans, toasted and roughly choppedEverything after the comma is telling you what to do with the ingredient, not what the ingredient is. The system strips 45+ known preparation phrases before trying to identify the food:
, melted, chopped, divided, softened, or to taste, sifted, diced, minced, sliced, grated, shredded, thawed, at room temperature, beaten, peeled, plus more for serving, for garnish, rinsed and drained...
It's just string matching, nothing fancy. But it handles a surprising amount of variation. 1 cup butter, divided and 1 cup butter, softened both land on the same food with the same weight.
There's a gotcha though. Some trailing phrases actually change the food:
0.25 lemon, **juiced**— the food is lemon juice, not lemon1 orange, **zested**— the food is orange zest, not orange
These get special treatment. The parser recognizes juiced and zested as transforms that change the food name itself rather than just describing a prep step.
Parentheticals
Let's talk parentheticals. Look at these:
1 (14 ounce) can sweetened condensed milk ← container size
1 cup cream of mushroom soup (condensed) ← descriptor
0.25 cup hot pepper sauce (such as Frank's) ← brand suggestion
0.25 cup mirin (Japanese sweet wine) ← definitionEach parenthetical means something totally different. The system uses a simple rule: if the parenthetical has a number + unit pattern (like 14 ounce), it's probably a container size and gets special parsing. Otherwise it gets stripped before food identification.
The container-size ones are especially tricky because the outer quantity is fractional packages:
0.5 (16 ounce) package frozen spinachThat means "half of a 16-ounce package" = 8 ounces. The quantity isn't 0.5 and the unit isn't package. The actual quantity is 0.5 × 16 = 8 ounces. The L2 cache can't handle this kind of nested arithmetic so it punts directly to the agent.
Synonym Collapse
Nutrition databases use canonical names. Recipe authors use whatever they feel like. All of these need to resolve to the same entry:
There are 100+ of these mappings in the system. It's tedious work to build but it's the difference between a 60% match rate and a 95% match rate. Every missing synonym is either a wasted agent call or a wrong nutrient profile attached to your recipe.
The AI Agent
When all else fails, throw an LLM at it. I built an AI agent with Pydantic AI and gave it access to two tools:
- Common foods lookup — ~4,000 foods with portion weights, loaded from an in-memory index
- USDA FoodData Central search — the comprehensive federal nutrition database, queried via API
The agent has a pretty narrow job:
- Identify the food — strip the noise, find the canonical name and database ID
- Estimate the weight in grams — using portion data from whichever database matched
- Return a confidence level —
high,medium, orlow
One thing I'm happy with in this design: the agent does not return nutrient data. That would mean way more output tokens and you'd risk the model hallucinating nutrient values. Instead it returns just the food identity (name + database ID) and a weight estimate. The actual nutrient profiles get attached in a separate step that reads directly from the database. Keeps the output small and the nutrition data trustworthy.
Weight estimation
Figuring out that 3 large eggs refers to "egg" in USDA is the easy part. The hard part is knowing that a "large egg" weighs 50 grams. The agent pulls portion data from the nutrition databases when it's available:
| Food | Portion | Weight |
|---|---|---|
| Egg | 1 large | 50g |
| Garlic | 1 clove | 3g |
| Butter | 1 stick | 113g |
| Lemon juice | 1 whole lemon | 84g |
| Green onions | 1 bunch | 100g |
| Olive oil | 1 cup | 216g |
When portion data isn't available, the agent falls back to rough defaults: 1 cup ≈ 240g for liquids, 1 tablespoon ≈ 15g, 1 teaspoon ≈ 5g. Not perfect, but better than nothing. The confidence field gets set to medium or low so downstream consumers know the estimate is fuzzy.
Compound ingredients
Some ingredient strings have multiple foods crammed in:
salt and ground black pepper to taste
That's actually two ingredients with two separate nutrition profiles. The agent returns two resolution objects for this single label:
Label: "salt and ground black pepper to taste"
→ Resolution 1: salt, 3.0g, confidence: medium
→ Resolution 2: black pepper, 0.5g, confidence: mediumThe "to taste" part means there's no explicit quantity. The agent picks reasonable defaults — 3 grams of salt, 0.5 grams of pepper — which are approximations, but sensible ones for a recipe serving 4-6 people.
The Multi-Layer Cache
Every time the agent resolves something, it produces two cache entries:
- L1 entry — the full resolution for this exact label, so we never ask about it again
- L2 identity — the food's nutrient profile and known portions, pulled from whatever database lookups the agent did
The L2 store is where things get interesting. When the agent resolves 1 cup butter, it doesn't just cache that specific label. It caches the fact that butter has known portions of 1 cup = 227g, 1 tbsp = 14.2g, 1 stick = 113g. Next time any label mentions butter in any quantity, L2 handles it locally without touching the agent.
After processing ~6,000 unique labels, the L2 store has over 1,850 food identities covering most common cooking ingredients.
Cost
This is where the caching strategy really pays off. Let me walk through the actual numbers.
Per-call token budget
The agent runs on Claude Haiku 4.5, chosen mostly for speed and cost: $1/M input tokens, $5/M output tokens. The system prompt is ~750 tokens and stays constant. Each batch sends 20 ingredient labels (~600 input tokens) and gets back structured JSON (~500 output tokens).
| Component | Tokens | Cost |
|---|---|---|
| System prompt | ~750 | $0.00075 |
| 20 ingredient labels | ~600 | $0.00060 |
| Structured JSON output | ~500 | $0.00250 |
| Total per batch | ~1,850 | $0.00385 |
That works out to about $0.004 per batch, or roughly $0.0002 per ingredient when the agent actually gets called. But the whole point is that the agent doesn't get called very often.
How it played out
I went through several agent revisions before landing on the current design, so the total API spend across all iterations was around $20. Most of that was earlier experiments that got thrown away. The current agent built its full cache — 6,000+ resolved labels across ~333 batches — for an estimated $2-3 in API costs including retries and tool call overhead.
The important thing is the trajectory. Early runs are expensive because every ingredient hits the agent. Once the caches warm up, the hit rate climbs fast:
By steady state, 99%+ of ingredients resolve from cache and new pipeline runs cost almost nothing.
Keeping the output schema small
I mentioned earlier that the agent only returns identification and weight, not nutrient data. Here's what a typical resolution looks like:
{
"ingredient_label": "1 (14 ounce) can sweetened condensed milk",
"food_name": "Sweetened condensed milk",
"fdc_id": 171286,
"quantity": 14.0,
"unit": "oz",
"weight_grams": 396.9,
"confidence": "high"
}That's ~60 output tokens. If I'd asked the agent to also return 35 nutrient fields (calories, protein, fat, fiber, sodium, vitamins...), each resolution would balloon to ~250 tokens — 4x the output cost. Across 6,000 agent-resolved labels, keeping the schema narrow saved an estimated $5-6 in output tokens alone. And more importantly, the nutrient data comes from the database rather than an LLM that might hallucinate 0 calories for olive oil.
Failure handling
The agent processes labels in batches of 20. When a batch fails (rate limit, timeout, ambiguous input), the system doesn't retry the whole batch. Instead it recursively splits in half. A batch of 20 becomes two of 10, then four of 5, isolating the bad label without wasting tokens on ones that would have succeeded.
Rate-limited calls use exponential backoff starting at 15 seconds and scaling up by 10 seconds per attempt, with a max of 4 retries. Labels that fail across 3 separate runs get written to a permanent failure file and skipped going forward. No infinite retry loops.
Results
As of writing, the cache has 6,059 resolved ingredient labels and 1,852 food identities with a zero failure rate. Every label that's gone through the pipeline has come out the other side with a resolution.
That doesn't mean every resolution is perfect. The confidence breakdown looks roughly like this:
| Confidence | Share | What it means |
|---|---|---|
| High | ~83% | Exact database match with portion data |
| Medium | ~16% | USDA fallback, incomplete portion data |
| Low | ~1% | "To taste" items with heuristic weights |
What Went Well
The resolution funnel. Having cheap tiers handle the common cases and only sending the hard stuff to the agent was the right call from the start. 85% of ingredients never touch the LLM.
Caching knowledge, not just answers. The L1 cache is fine, but the L2 identity store is the real win. Caching butter's entire portion table instead of just 1 cup butter → 227g means every future butter label resolves for free regardless of quantity or unit. Single biggest cost optimization in the system.
Keeping the agent's output small. Having the agent return just identification and weight — no nutrient data — keeps output tokens low and means the nutrition numbers come from the database, not from the model.
Normalizing aggressively. Unicode fractions, plurals, whitespace, casing — all free to eliminate and they immediately collapse the problem space. This was easy to build and paid for itself many times over in cache hit rate.
Learnings
The first agent design was per-recipe. The original implementation resolved ingredients one recipe at a time. This meant 1 cup butter got sent to the agent dozens of times across different recipes before I added cross-recipe deduplication. It worked, but it was slow and expensive. The batch deduplication rewrite was a significant refactor.
USDA data is inconsistent. I assumed the USDA FoodData Central API would return portion data in a predictable format. It doesn't. Different food types (branded vs foundation vs survey) use different response structures, and some foods just don't have portion data at all. I ended up with multiple fallback chains and special cases for things like energy values using Atwater factors instead of standard calorie IDs.
Batch failures were painful to debug. When a batch of 20 ingredients fails, you don't know which one caused it. I didn't plan for this and initially just retried the whole batch, which wasted tokens. The recursive split-in-half strategy was a reactive fix after watching batches fail repeatedly on the same problematic label buried in a group of 19 good ones.
Multiple agent revisions. The current agent is the third or fourth major iteration. Earlier versions used different prompting strategies, different output schemas, and different batch sizes. Each revision invalidated parts of the cache, which is how the total API spend ended up around $20 despite the current agent only needing $2-3 to build its cache from scratch.