Ulf Aslak / works

I built netforensics for a one-day hackathon — a decision-lab pack that takes any labeled graph and tells you whether you can actually deploy a classifier on it, or whether the numbers you'd publish are an artifact of how you split the data. To stress-test it, I pointed it at the Elliptic Bitcoin dataset — the most-cited benchmark for cryptocurrency fraud detection. The pack reproduced, in about 20 minutes and under a dollar in agent calls, a methodological problem that several papers in the GNN literature have shipped right past.

The result, in plain terms: under valid evaluation, a boring gradient-boosted tree on tabular features beats every graph neural network we tried by 20 F1 points. The famous "near-perfect" numbers on this dataset are 18% inflated by a temporal leakage trap that random train/test splits walk straight into. And once you remove the train-time leakage from the GNNs as well, the edge-shuffle ablation flips direction: under honest training, GraphSAGE actually performs better on a degree-preserving random graph than on the real Bitcoin transaction graph. None of this is news if you've read the right papers; the demo is that the pack catches it without anyone telling it to look.

Why bother making this a pack?

Most supervised graph problems share the same handful of failure modes, and most of them are about evaluation rather than modeling. The pack encodes five things that should be on by default for any graph-with-labels problem, and that almost never all are:

Always benchmark against a no-graph baseline (XGBoost on the raw features).
Always evaluate under the distribution shift the model will face at deployment. If the data has a time axis, split on time. Never randomly.
Always run an edge-shuffle ablation. Rewire edges while preserving degree. If performance doesn't drop, the graph isn't doing work.
Always report metrics that survive class imbalance. F1 on the positive class, precision-at-K. Headline AUROC is misleading under 5% positives.
Always run multiple seeds and report variance.

Pointed at a dataset, the pack drives an LLM orchestrator through each of these as a sequence of tool calls, then writes a business report and a technical report. The framing is methodological rigor that holds across every graph ML problem, not a checker for any specific dataset — but the rigor is precisely what catches Elliptic.

What this network actually is.

Each node is a Bitcoin transaction. Each edge is a flow of Bitcoin from one transaction's output into the next transaction's input. Transactions are labeled by the real-world entity behind them: licit (exchanges, wallet providers, miners, licit services) or illicit (scams, malware, ransomware, ponzis, terrorist organizations). 77% are unlabeled. The task is to predict the label of the unlabeled ones from the structural and feature-level fingerprints of the labeled subset.

At the dataset level: 203,769 transactions, 234,355 payment edges, 165 features per transaction, spread across 49 sequential snapshots (≈ 2 weeks each). About 2.2% of nodes are illicit, 20.6% licit, 77% unlabeled. Crucially, every edge lives inside a single timestep — the graph is a sequence of weakly connected snapshots, not a single evolving graph. That structural detail is what makes the leakage trap so tempting: the snapshots look independent enough that a random node-level split feels fine, but the dataset has very strong temporal regime changes (a darknet market closes around timestep 43 and the illicit base rate halves).

Below is a single snapshot — time step 32, with 4,525 transactions and 342 of them labeled illicit. Red = illicit (scams/ransomware/etc.), green = licit (exchanges/miners/etc.), grey = unlabeled. Flip the toggle to color the unlabeled grey ones by what the XGBoost classifier predicts. To scrub through all 49 timesteps, open the full viz below.

Open the full interactive viz (all 49 timesteps, brush + prediction toggle) →

The shape of the data, once you've watched it for a minute, makes the rest of the story almost inevitable. A model trained on snapshots 1–34 and evaluated on snapshots 35–49 is being asked to predict in a different regime than the one it learned. A model trained on a random sample of nodes from across the whole timeline is being shown the future during training.

The leakage gap.

Two evaluation protocols, same dataset, same model (XGBoost on all 165 features), five seeds with cutoff jitter on the temporal partition. Only the train/test split changes:

Protocol	F1 (illicit), median [min–max]	What it measures
Transductive (random split)	0.959 [0.958–0.960]	What most papers report
Temporal (train past, test future)	0.780 [0.771–0.783]	What deployment actually looks like

A gap of 0.18 F1 points, or 18% relative. That is the cost of a single bad assumption about which split to use. The pack also supports an inductive split (held-out connected components), but that protocol ignores the temporal axis and isn't an honest deployment estimate when the data has timestamps, so it's skipped by default on Elliptic.

Then the embarrassing part for the GNN literature.

Under the honest temporal protocol, with strict edge constraints (training-time message-passing restricted to train-set edges, so the GNN can't peek at the future graph either), the model rankings flip:

Model	F1 (illicit), median [min–max]	Stack
XGBoost (tabular)	0.780 [0.771–0.783]	CPU, <1 s per seed
GraphSAGE	0.584 [0.544–0.596]	CPU PyG, ~100 s per seed
GCN	0.320 [0.302–0.337]	CPU PyG, ~100 s per seed

The tabular baseline is 20 F1 points ahead of the best GNN. The dataset comes with 71 pre-computed 1-hop neighborhood aggregates already baked into the feature vector, and those aggregates eat most of the structural signal the GNNs would otherwise have to learn.

The edge-shuffle ablation puts a finer point on it. Train each GNN on the real graph and again on a degree-preserving randomized version, then look at the F1 gap:

Model	F1 gap (real − shuffled), median [min–max]	What it means
GCN	+0.13 [+0.11, +0.14]	Real graph helps GCN a little
GraphSAGE	−0.07 [−0.08, −0.04]	Real graph hurts GraphSAGE

That negative number for GraphSAGE is the moment the methodology earns its keep. Under leakage-free training, GraphSAGE trains better on a random graph that only matches the degree distribution than on the actual Bitcoin transaction graph — by 7 F1 points, consistently across all five seeds. The structure in the real graph is, for this model, anti-informative. GCN does extract a small positive signal from the real topology (+0.13), but its absolute F1 of 0.32 is so far below the tabular baseline that the improvement is academic.

An earlier, sloppier version of this same pack ran the edge-shuffle without forcing strict-edges in the train-time message-passing — i.e., the GNN got to see test-region edges during training in both the real and the shuffled runs. In that leaky regime, GraphSAGE looked like it was using the graph: shuffling the edges cost it several F1 points. The "graph helps" reading was an artifact of leakage breaking down when the edges were rewired, not of genuine graph signal. Once the ablation is run honestly, the direction reverses. The leakage trap had two layers, not one.

Where the signal actually lives.

Feature ablation, all XGBoost, honest temporal protocol, five seeds:

Features	F1 (illicit), median [min–max]
All 165 features	0.780 [0.771–0.783]
Raw 94 per-transaction features only (no graph access)	0.704 [0.704–0.712]
Topology only (7 graph statistics: degree, PageRank, clustering, ...)	0.163 [0.153–0.165]

Roughly 90% of the deployable signal is in per-transaction metadata. The pre-computed 1-hop neighborhood aggregates contribute another ~8 F1 points on top — not via learned message-passing, just via tabular features that happen to summarize the local graph. Stripping features to topology alone collapses the model. For the operational question — do we need a graph database, a GPU, and a PyTorch Geometric pipeline — the answer is no. Compute the aggregates as a nightly batch job and ship the boring model.

What is and isn't the point.

The point is not that GNNs are bad, or that Elliptic is a trap. The point is that with five always-on diagnostics, you get a defensible answer or a clear statement of why a defensible answer isn't possible yet. The pack didn't know anything about Elliptic; it ran the same protocol it would run on a Reddit interaction graph or an arXiv co-authorship network. The Elliptic finding falls out as a side effect.

I keep thinking about how many graph ML pipelines ship without any one of these diagnostics, never mind all five. Random split, single seed, AUROC headline, no edge-shuffle, no tabular baseline. The model could be doing nothing — or worse than nothing, as the GraphSAGE edge-shuffle reversal here shows — and you'd never know. A 20-minute, sub-dollar agent run is a cheap way to know.

Pack lives at pymc-labs/decision-lab. Point it at a folder with features.csv, edges.csv, and labels.csv:


dlab --dpack decision-packs/netforensics \
  --data path/to/your/graph \
  --env-file .env \
  --work-dir ./my-run \
  --prompt "Evaluate whether this dataset supports a [your task here] detector."

It will run, and tell you something true.