Python is slow for ...

Python is slow if you do millions of tiny ops per second. The trading system does not stress every part equally. Some parts of your code are extremely hot, while others are not. For a 15-minute BTC binary trading system, most code is not running millions of times per second. Python takes 100 microseconds or 1 ms for some operations like logging, placing orders, checking balance, updating state, etc. Expensive parts of code like JSON parsing happen thousands of times per second, where Python might hurt latency. Here are some good Python packages that are good out of the box.

orjson
msgspec
uvloop
httpx
websockets
numba
sortedcontainers

Here are all the optimization using packages

# ============================================================
# Optimization packages for a Python trading system
# ============================================================


# ------------------------------------------------------------
# orjson: faster JSON decoding than the standard json library
# ------------------------------------------------------------

import json
import orjson

# Standard library JSON
msg = json.loads(raw)

# Faster Rust-backed JSON
msg = orjson.loads(raw)


# ------------------------------------------------------------
# msgspec: fast structured messages
# Use this instead of Pydantic for hot-path tick/order book data.
# ------------------------------------------------------------

from pydantic import BaseModel
import msgspec

# Slower, better for configs / API validation / non-hot paths
class BookDeltaPydantic(BaseModel):
    price: int
    size: int
    side: str
    seq: int

# Faster, better for market data / ticks / hot paths
class BookDelta(msgspec.Struct):
    price: int
    size: int
    side: str
    seq: int


# ------------------------------------------------------------
# uvloop: replaces asyncio's default event loop with a faster
# libuv-backed event loop
# ------------------------------------------------------------

import asyncio
import uvloop

asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())


# ------------------------------------------------------------
# httpx with HTTP/2 and persistent connections
# Reuse a long-lived client so TCP/TLS connections stay warm.
# ------------------------------------------------------------

import httpx

url = "https://example.com/orders"

# Bad: creates a new client and connection for every order
async def place_order_slow(order):
    async with httpx.AsyncClient() as client:
        return await client.post(url, json=order)

# Better: reuse a long-lived HTTP/2 client
client = httpx.AsyncClient(http2=True)

async def place_order(order):
    return await client.post(url, json=order)


# ------------------------------------------------------------
# numba: useful for math-heavy functions
# ------------------------------------------------------------

from numba import njit

# Normal Python version
def fair_binary_price_python(btc_price, strike, time_left, vol):
    ...

# JIT-compiled version
@njit
def fair_binary_price(btc_price, strike, time_left, vol):
    ...


# ------------------------------------------------------------
# sortedcontainers: useful for maintaining sorted price levels
# in an order book
# ------------------------------------------------------------

from sortedcontainers import SortedDict

bids = SortedDict()
asks = SortedDict()

# Example book updates
bids[55] = 100
bids[56] = 80

asks[57] = 120
asks[58] = 90

# Best bid: highest bid price
best_bid_price, best_bid_size = bids.peekitem(-1)

# Best ask: lowest ask price
best_ask_price, best_ask_size = asks.peekitem(0)

BBA Data in Liquidity and Market Depth Workflows

BBA (best bid and best ask) is basically the smallest “unit” of market microstructure data that updates the top of book. A tight bid-ask spread is often used as an indicator of higher liquidity, while wider spreads can hint at lower liquidity or more volatile conditions (separate conversation though). On crypto platforms, this stream is delivered as WebSocket JSON messages (e.g. the bookTicker message on Binance’s derivatives API). Their Individual Symbol Book Ticker Stream is documented as “real time” updates containing bid/ask price and quantities.

There are two nuance points from the exchange provider side that matter for parser design. I assume that:

Schema can evolve. Binance derivatives changelog says that the all-symbol stream changed update speed while per-symbol remained unaffected, which is a reminder that feed behavior changes do happen.
Streams contain non-ASCII encoded UTF-8.

JSON Parsing Costs

Even when I only need ~2–6 fields from each JSON message, a general-purpose JSON parser still has to do the whole job, including:

Identify structural characters ({, }, :, ,, etc.), distinguish content inside vs. outside strings, and validate encoding + escaping rules. In strict parsers, this includes UTF-8 verification and rejecting invalid code points.
Convert the parsed representation into the runtime’s data model. In Python, that usually means materializing a full object graph of dict, list, str, int, float, bool, and None. orjson.loads() is explicit that it deserializes into those Python types.

Materializing Python objects is frequently the dominant cost when the document is small but the message rate is huge. A useful comparison point is modern “lazy / on-demand” parsing work: instead of building a full DOM/tree for everything, parsers can index structure and allow queries/iteration that only materialize required values, so the program skips irrelevant data faster. When my pipeline only needs {T, b, a, B, A}, parsing into a dict and allocating a str object for every key/value plus intermediate containers is doing more work than I actually need for a quick decision.

`orjson`

orjson is a Rust backend CPython extension. For deserialization, the current build configuration shows it vendors and compiles the C library yyjson (unless explicitly disabled by an environment variable), which is one reason its parse stage can be extremely fast. yyjson is a high-perf C JSON library that’s strict by default and supports custom allocators. The core benefit of orjson.loads() is that:

It deserializes JSON into native Python types and is generally faster than the standard library json (for regular workloads).
It accepts bytes, bytearray, memoryview, and str, and it recommends passing bytes-like input when I already have it to avoid creating a str.
It’s strict about UTF-8 and rejects invalid JSON that the standard library could potentially allow. It also maintains a process-wide key cache for map keys to reduce duplicate string allocations across repeated workloads.
The docs note the GIL is held for the duration of loads(), which limits parallelism if multiple threads are parsing.

orjson is an excellent baseline and makes the general case fast. But it doesn’t change the fundamental fact that I still pay for allocating Python objects for the whole parsed structure.

Specialized Parsing

I can shift the paradigm of what parsing really means. General JSON parsing is typically “validate + build the complete representation.” A specialized feed parser means I locate known keys and parse known numeric formats to return only the values I need. That skips irrelevant data by skipping construction of a full representation, which collapses instruction count and allocation requirements.

Fixed Format TOS Parser#

A specialized parser is faster mainly because it can:

Avoid allocating intermediate containers and str objects.
Avoid general parsing branches (don’t support arbitrary whitespace, arbitrary nesting, etc.) if I can safely assume my upstream data won’t contain those patterns.
Allow CPU-specific optimizations that help branch prediction and prefetching. Using a two-stage “index then consume” strategy (SIMDJSON / on-demand style) is a way to exploit predictable structure and SIMD indexing to reduce branching.

Tradeoffs: the parser will break under whatever the market-data feed gives it. If it’s wrong, it can be worse than being slow. A default strategy is smart here: a pragmatic pattern is fast path → if it fails invariants, fallback. Average latency then becomes something like:

\text{Success Rate} \cdot (\text{Bridge Speed}) + \text{Failure Rate} \cdot (\text{Failure Speed} + \text{orjson Speed})

Here’s the checklist I keep in mind for a cpp parser:

False matches: Don’t just search for b/"b" anywhere — match the actual JSON key pattern ("b"\s*:), so I don’t hit the b inside "bookTicker" or other strings. I should test with adversarial inputs.
Schema drift: Real feeds change (field order, extra fields, whitespace, "b"/"a" switching from strings to numbers). The fast parser should either tolerate this or detect it and bail.
Numbers: My float parser must handle max real precision and guard bounds (e.g., fractional digits > my pow10 table → reject/fallback, not UB/wrong).
UTF-8/escapes: If JSON escaping + UTF-8 validation isn’t fully implemented, I should treat any unexpected escape/byte as reject/fallback, not “best-effort parse.”
Best practice: Fast path parse → if any invariant fails, fall back to a standards-compliant parser (or drop + count).

Cython Wrapping, GIL, Boundary Costs

My current design is to bridge the C++ code using Cython:

PyUnicode_AsUTF8AndSize() returns a pointer to the cached UTF-8 representation of a Python str. That UTF-8 buffer is cached on the string object, so repeated calls return the same pointer. The buffer lifetime is tied to the str object (invalid once the object is garbage-collected).
Because the WebSocket payload is text, I typically start from a Python str. Converting to UTF-8 bytes isn’t inherently “free,” but CPython’s API makes “borrow a const char* UTF-8 view” possible, and it caches that representation to amortize repeated conversions.

The performance advantage of the wrapper is that I cross the Python/native boundary once, parse in native code, and then return a minimal Python representation (tuple like (ts, bid, ask) or (ts, bid, bid_sz, ask, ask_sz)). That’s structurally different from orjson.loads(), which must return a full dict and all child objects.

When using Cython, one improvement worth considering (especially if I want parsing on worker threads) is restructuring the wrapper so that:

I call PyUnicode_AsUTF8AndSize() while holding the GIL (required),
then I release the GIL around the pure C++ parse (no Python API calls),
then reacquire the GIL only to build the return tuple.

This matters because orjson.loads() explicitly holds the GIL during parsing. A custom extension can sometimes do better for multi-threaded ingestion if it avoids Python object work in the hot section.

Compiler Speedups#

Using -O3 and CPU-specific flags can help, but portability becomes a real constraint once I want to distribute wheels across heterogeneous fleets. As an example, orjson explicitly discusses wheel baseline compatibility and runtime acceleration (AVX-512 where available), which is the general direction I’d copy if I intend to ship binaries broadly rather than compile on every machine.

Rust Adjacent: `pyO3/maturin`

There are two Rust-adjacent paths that fit with my cpp design because they preserve the core idea: don’t build the Python dict when deserializing. maturin is designed to build and publish Rust-based Python packages for pyo3. pyo3 provides Rust bindings for the Python interpreter and is used to create native Python modules. Also, pyo3 provides an explicit way to detach and release the GIL while running Rust-only code (then reacquire when needed).

Here are additional designs I’ve looked at:

Typed deserialization: use serde_json to deserialize into a struct, which can be faster than deserializing into a generic dict-like representation because field mapping is fixed and avoids dynamic work.
SIMD-accelerated JSON crates: simd-json is a Rust port of simdjson with serde compatibility. sonic-rs is another crate with SIMD-driven performance. Tradeoffs depend on workload, mainly around struct deserialization.

Benchmarking

To measure perf.

keep I/O out of the timed region,
isolate numeric conversion cost,
run with pyperf to reduce noise and report distributions.

I also call out CPU flags and SIMD behavior separately because results differ across CPUs. It’s probably good to look into C++ “fast float” logic around the exact decimal formats I actually observe in captured messages.

Additional Information#

GIL#

Python 3.13 and 3.14 have a free-threaded build, basically a no-GIL Python. This allows threads to run in parallel on a multi-core machine. If you install regular Python, threads won't speed up CPU-bound code. A GIL is the Global Intrepreter Lock. In CPython, the GIL is a mutex/lock that ensures only one thread executes Python bytecode at a time within a single process. It exists because CPython's memory management is simpler and faster when protected by one global lock.