Work with me - AI Engineering

AI Engineering - Work with me

Prototype. Error Analysis. MVP. Eval Loop.

I’ll help you grow strong prompt ideas into reliable AI features and apps — pragmatically, with a clear structure.

Why work with me

I’m a full-time Cloud Service Designer who builds AI solutions on the side (~10 hours/week). I’ll help you build you AI solutions without agency overhead: fast prototypes, error analysis, measurable quality, and a clean path to production.

What I deliver

Runnable Proof-of-Concept app in days to validate if AI can solve the task.
Error Analysis: I help you set up tracing, create a custom eval app, and facilitate open/axial coding so your experts can label failure modes. You own the calls; I make error analysis fast and repeatable.

Error-analysis is key and the highest-ROI activity in AI engineering.

MVP Consulting: I help you turn the prototype into an MVP app or integrated feature and setup tracing to kickstart the eval loop.

How I work

1) Build Prototype (Python)

Define use case & data flow, ship a working prototype with explicit assumptions; capture first metrics (accuracy, latency, cost per request).

2) Setup Error Analysis

Create a labeled evaluation set (golden set), map error types, prioritize root causes; fix iteratively (prompt rewrites, tools, guardrails, retrieval tuning).

3) Launch MVP

Embed into a simple, usable flow (web/app/workflow), add observability (logs, traces, alerts), align on SLAs & success metrics.

4) Setup Eval Loop

Create automated evaluations with LLMs as judge to run regression tests before each prompt/model update and maintain a live benchmark for continuous improvement.

Engagement formats

Prototype Sprint (1–2 weeks): working demo + first metrics.
MVP Launch Sprint (2–4 weeks): user flow, observability, go-live checklist.
Eval Care (ongoing, light-touch): scheduled regression tests, prompt-garden upkeep, KPI reviews.

I work evenings and selected blocks during the week (Europe/Berlin) with planned check-ins.

What you’ll get

Tech setup: Python codebase, structured prompt library, evaluation scripts
Artifacts: architecture sketch, metrics sheet, risk/fallback plan
Ops: logging, monitoring, cost-sensible defaults, rollback strategy
Handover: concise playbooks and a transfer session for your team

Security & Architecture

Your data stays under your control, on your systems.

Is this a fit?

You want a small, focused engagement with measurable outcomes.
You value concierge consulting over big teams.
You’re fine with an about 10h/week cadence and clear milestones.

FAQ

Do you work alone?

Mostly. When it adds speed or depth, I pull in specialsts e.g., data engineering, front-end. I stay the lead and keep scope transparent.

Do we need our own data?

Not necessarily, but real data is always preferred. We can begin with public or seed synthetic data while we assemble a golden set for robust evaluations.

Which tools do you use?

Python stack, web/cli/local UI apps, OpenAI AgentKit, OpenAI/Claude/Gemini models, Pydandic AI, LangChain, LangGraph, or plain Python, Qdrant Vector DB, SQLite, PostgreSQL, Redis.

After the MVP?

The eval loop keeps quality steady and de-risks updates as you scale.

How to get started

Ready to turn your prompt idea into a product or production-ready feature?

Call me to schedule an intro call.