Bing Tan — Machine Learning Engineer

About

Model Card: bing-tan-v2.6

Architecture: Human (coffee-powered transformer variant)
Training Data: BSc & MSc Business Analytics @ VU Amsterdam (Computational Intelligence track), 2+ years industry ML
Intended Use: Production ML systems, inference optimization, scale
Fine-tuned On: PyTorch, Kubernetes, late-night debugging sessions
Hyperparameters: curiosity=0.95, persistence=0.92, coffee_intake=high
Known Limitations: Cannot resist optimizing things that are "fine"
License: Open to interesting problems

Most ML engineers come from computer science. I came from business analytics. My MSc at VU Amsterdam had me building deep learning models in the Computational Intelligence track, but my undergrad taught me something most engineers skip: why the model matters to the business. A model that doesn't ship is just a notebook.

My work runs the whole altitude range: multi-task transformers serving e-commerce search at 1000 requests per second, and a personal AI app where the entire model fits on a phone and the network is not an option. Same discipline at both ends — count your milliseconds, budget your tokens, test your invariants.

When I'm not shipping, I'm teaching cloud engineering to TU Delft students, running my house on GitOps, or building interactive toys that explain how this stuff actually works.

Primary Capabilities

Training & NLP

Multi-task transformers, domain adaptation, hyperparameter search, evaluation discipline

Serving & Infrastructure

Request batching, GKE + Terraform, load testing, latency SLAs, observability

Agents & On-Device AI

LLM agents with tool use, context engineering, hybrid retrieval, on-device inference

How I Think

The best model is the one that's actually in production.

If your eval set is synthetic, you don't have an evaluation. You have an optimistic simulation.

Before blending two scores, prove they share a scale. With a test, not a comment.

terraform destroy is a cost-control habit, not an emergency procedure.

The model is the voice. Never the librarian.

Tech Stack

ML & NLP

PyTorch PyTorch Lightning HuggingFace ONNX Runtime Optuna

Serving & Infra

Kubernetes Terraform Docker FastAPI Argo CD

Cloud & Observability

GCP Vertex AI BigQuery Airflow Prometheus Grafana Locust

Languages

Python Swift SQL JavaScript

Case Studies

One Encoder, Three Jobs

Search intent classification for a large Dutch e-commerce platform grew into three BERT models: buying intent, brand/creator recognition, and category prediction over roughly fifteen thousand labels. Three models meant three forward passes on every query, at search-traffic volumes.

All three were fine-tuned from the same domain-adapted encoder, and only the top layers diverged. So we composed them into a single system model: one pass through shared layers 0–9, branching at layer 10 into three task-specific heads. One tokenization, one encoder run, three answers.

The humbling lesson came from evaluation: the model looked great on synthetic held-out sets, then dropped eight F1 points on real labeled queries. Real data dominated evaluation from that day on.

2.5×

Faster Than Separate Models

Tasks, One Forward Pass

15,000

Product Categories

Read the full story →

The 150-Millisecond Question

The serving side of that classifier had one non-negotiable: P90 under 150 milliseconds at 1000 requests per second. Three Terraform stacks with separate lifecycles — training, serving, and a stress-test cluster whose only job was to attack the serving cluster with Locust, because at 1000 RPS the load generator is a distributed system too.

The headline finding wasn't in the infrastructure. It was in the API process: collecting requests in an async queue and draining up to 16 of them every 10 milliseconds into one forward pass tripled sustainable throughput on identical hardware. Runner-up: torch.set_num_threads(1), because PyTorch's eagerness to parallelize small inferences meant threads fighting each other under load.

3×

Throughput From Batching

<150ms

P90 at 1000 RPS

60–70%

Cheaper on Preemptible Nodes

Read the full story →

A Second Brain in 8,192 Tokens

My first iOS app: a personal AI second brain where the model runs entirely on the phone. No API, no server — the app's only network request is downloading the model itself, once. The context window is 8,192 tokens on a good device, and everything must fit.

The constraint became the architect. Search never touches the LLM: hybrid BM25 + vector retrieval answers in under half a millisecond. Token budgets are enforced by a cheap estimator validated against the real tokenizer in CI. And privacy is a compile-time property: a test scans every source file and fails the build if networking primitives appear outside the one sanctioned download function.

0.3ms

Keyword Search Latency

8,192

Token Budget, Everything Included

Network Calls (Tested, Not Promised)

Read the full story →

Full write-ups, including the bugs, in the linked posts.

Training Progress

Systems I've Built

Some for clients, some for me. Every one of them has a story — click through.

Receipt-Reading Expense Agent

An AI agent that reads receipt photos with a vision model, checks them against expense policy, and politely declines your Netflix subscription. Evaluated like software, not vibes.

In production

AgentsVision LLMEvals

Read the story →

AI Agents Course

A 2-day hands-on course on building and evaluating AI agents. Designing the agents was the easy part — testing non-deterministic systems in front of a room was the real curriculum.

2 days, hands-on

Google ADKMulti-agentEvals

Read the story →

Home Platform on GitOps

A k3s cluster running Home Assistant, Matter, and everything else my house needs. Every change is a git push; Argo CD does the rest. My thermostat has a deployment pipeline.

Prod uptime, at home

k3sArgo CDHome Assistant

Read the story →

Self-Terminating Minecraft Server

A cloud Minecraft server that checks its own player count and kills itself when nobody's online. Preemptible VMs, Terraform, and a world that persists between suicides.

€0 when idle

TerraformGCPPreemptible VMs

Read the story →

Personal projects and experiments on GitHub →

Writing

Technical deep-dives on ML systems, infrastructure, and lessons from production.

ML Playground

Real algorithms, running in your browser. No servers, no API keys, no tricks. View source welcome.

◉

Contact

My inference endpoint is always warm. Reach out about ML systems, interesting problems, or your hot take on batch normalization.

Email GitHub LinkedIn

I reduce loss functions
for a living.

About

Model Card: bing-tan-v2.6

Primary Capabilities

Training & NLP

Serving & Infrastructure

Agents & On-Device AI

How I Think

Tech Stack

ML & NLP

Serving & Infra

Cloud & Observability

Languages

Case Studies

One Encoder, Three Jobs

The 150-Millisecond Question

A Second Brain in 8,192 Tokens

Training Progress

Systems I've Built

Receipt-Reading Expense Agent

AI Agents Course

Home Platform on GitOps

Self-Terminating Minecraft Server

Writing

ML Playground

Neural Trainer

Gradient Descent

Batching Simulator

Token Budget

Hybrid Search

Backprop Runner

Contact

I reduce loss functionsfor a living.

About

Model Card: bing-tan-v2.6

Primary Capabilities

Training & NLP

Serving & Infrastructure

Agents & On-Device AI

How I Think

Tech Stack

ML & NLP

Serving & Infra

Cloud & Observability

Languages

Case Studies

One Encoder, Three Jobs

The 150-Millisecond Question

A Second Brain in 8,192 Tokens

Training Progress

Systems I've Built

Receipt-Reading Expense Agent

AI Agents Course

Home Platform on GitOps

Self-Terminating Minecraft Server

Writing

ML Playground

Neural Trainer

Gradient Descent

Batching Simulator

Token Budget

Hybrid Search

Backprop Runner

Contact

I reduce loss functions
for a living.