AI Engineering
Prototype. Error Analysis. MVP. Eval Loop.
I’ll help you grow strong prompt ideas into reliable AI features and apps — pragmatically, with a clear structure.
Why work with me
I’m a full-time Cloud Service Designer who builds AI solutions on the side (~10 hours/week). I’ll help you build you AI solutions without agency overhead: fast prototypes, error analysis, measurable quality, and a clean path to production.
What I deliver
Runnable Proof-of-Concept app in days to validate if AI can solve the task.
Error Analysis: I help you set up tracing, create a custom eval app, and facilitate open/axial coding so your experts can label failure modes. You own the calls; I make error analysis fast and repeatable.
Error-analysis is key and the highest-ROI activity in AI engineering.
- MVP Consulting: I help you turn the prototype into an MVP app or integrated feature and setup tracing to kickstart the eval loop.
How I work
1) Build Prototype (Python)
Define use case & data flow, ship a working prototype with explicit assumptions; capture first metrics (accuracy, latency, cost per request).
2) Setup Error Analysis
Create a labeled evaluation set (golden set), map error types, prioritize root causes; fix iteratively (prompt rewrites, tools, guardrails, retrieval tuning).
3) Launch MVP
Embed into a simple, usable flow (web/app/workflow), add observability (logs, traces, alerts), align on SLAs & success metrics.
4) Setup Eval Loop
Create automated evaluations with LLMs as judge to run regression tests before each prompt/model update and maintain a live benchmark for continuous improvement.
Engagement formats
- Prototype Sprint (1–2 weeks): working demo + first metrics.
- MVP Launch Sprint (2–4 weeks): user flow, observability, go-live checklist.
- Eval Care (ongoing, light-touch): scheduled regression tests, prompt-garden upkeep, KPI reviews.
I work evenings and selected blocks during the week (Europe/Berlin) with planned check-ins.
What you’ll get
- Tech setup: Python codebase, structured prompt library, evaluation scripts
- Artifacts: architecture sketch, metrics sheet, risk/fallback plan
- Ops: logging, monitoring, cost-sensible defaults, rollback strategy
- Handover: concise playbooks and a transfer session for your team
Security & Architecture
Your data stays under your control, on your systems.
Is this a fit?
- You want a small, focused engagement with measurable outcomes.
- You value concierge consulting over big teams.
- You’re fine with an about 10h/week cadence and clear milestones.
FAQ
Do you work alone?
Mostly. When it adds speed or depth, I pull in specialsts e.g., data engineering, front-end. I stay the lead and keep scope transparent.
Do we need our own data?
Not necessarily, but real data is always preferred. We can begin with public or seed synthetic data while we assemble a golden set for robust evaluations.
Which tools do you use?
Python stack, web/cli/local UI apps, OpenAI AgentKit, OpenAI/Claude/Gemini models, Pydandic AI, LangChain, LangGraph, or plain Python, Qdrant Vector DB, SQLite, PostgreSQL, Redis.
After the MVP?
The eval loop keeps quality steady and de-risks updates as you scale.
How to get started
Ready to turn your prompt idea into a product or production-ready feature?
Call me to schedule an intro call.