From the speaker who got kicked off the stage after 54 minutes of his 45-minute PyParallel talk at PyData NYC 2013, comes a new talk foaming about the virtues of Python’s new free-threaded support!
2025-11-08
Greg’s free threaded patch, Greg’s Free threading email in 2001, Guido opining on GIL removal in It isn’t Easy to Remove the GIL, Dave Beazely’s review 15 years later: An Inside Look at the GIL Remove Patch of Lore
Simple Example
Real World Example
from concurrent.futures import ThreadPoolExecutor, as_completed
def do_work(chunk):
...
work = [...] # Some list of work items
errors = []
results = []
max_workers = min(os.cpu_count(), len(work))
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(do_work, item): item
for item in work
}
for future in as_completed(futures):
try:
result = future.result()
results.append(result)
except Exception as e:
errors.append(e)conda create -n py314 python=3.14 python-freethreading
model_19072.ptbuild-nanogpt’s train_gpt2.py):CausalSelfAttentionMLP (multi-layer perceptron)BlockGPTGPT.generate()class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
assert config.n_embd % config.n_head == 0
# Key, query, value projections for all heads, but in a batch.
self.c_attn = NoInitLinear(config.n_embd, 3 * config.n_embd)
# Output projection.
self.c_proj = NoInitLinear(config.n_embd, config.n_embd)
self.c_proj.NANOGPT_SCALE_INIT = 1
# Regularization.
self.n_head = config.n_head
self.n_embd = config.n_embd
def forward(self, x):
# Batch size, sequence length, embedding dimensionality.
B, T, C = (x.size())
# Calculate query, key, values for all heads in batch and move head
# forward to be the batch dim.
#
# N.B. nh is "number of heads", hs is "head size", and C (number of
# channels) is nh * hs. E.g. in GPT-2 (124M), n_head=12, hs=64,
# so nh*hs=C=768 channels in the Transformer.
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
head_dim = C // self.n_head
# (B, nh, T, hs)
k = k.view(B, T, self.n_head, head_dim).transpose(1, 2)
# (B, nh, T, hs)
q = q.view(B, T, self.n_head, head_dim).transpose(1, 2)
# (B, nh, T, hs)
v = v.view(B, T, self.n_head, head_dim).transpose(1, 2)
# Flash attention.
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
# Re-assemble all head outputs side by side.
y = (y.transpose(1, 2).contiguous().view(B, T, C))
# Output projection.
y = self.c_proj(y)
return y
# Multi-Layer Perceptron
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = NoInitLinear(config.n_embd, 4 * config.n_embd)
self.gelu = nn.GELU(approximate='tanh')
self.c_proj = NoInitLinear(4 * config.n_embd, config.n_embd)
self.c_proj.NANOGPT_SCALE_INIT = 1
def forward(self, x):
x = self.c_fc(x)
x = self.gelu(x)
x = self.c_proj(x)
return x
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
class GPT(nn.Module):
...
def generate(
self, text: str, max_length: int = 1024, top_k: int = 50,
seed: int = None, save_rate: callable = None
) -> str:
"""
Generate text from the model.
Args:
text (str): Supplies the prompt to condition on.
max_length (int): Maximum total length (prompt + generated).
top_k (int): Number of tokens to consider at each generation step.
seed (int): Optionally supplies the manual seed to use for the
generator. If None, the model's manual seed will be used.
save_rate (callable): Optionally supplies a callable that will be
called with the tokens per second rate.
Returns:
str: The generated text (including the initial prompt).
"""
enc = self.enc
device = self.device
stop_token = self.stop_token
# Encode prompt -> tensor of shape (1, T)
tokens = enc.encode(text)
x = torch.tensor(
tokens,
dtype=torch.long,
device=device
).unsqueeze(0)
# Create a random generator for reproducibility.
sample_rng = torch.Generator(device=device)
if seed is None:
seed = self.manual_seed
sample_rng.manual_seed(seed)
output = []
# Generate tokens up to our max length, or until we hit the stop token.
start = time.perf_counter()
count = 0
while x.size(1) < max_length:
count += 1
with torch.no_grad():
# Forward pass, ignoring the returned loss.
(logits, _) = self(x)
# Take the logits at the last time-step (shape: (1, vocab_size)).
logits = logits[:, -1, :]
# Convert to probabilities.
probs = F.softmax(logits, dim=-1)
# Top-k sampling.
topk_probs, topk_indices = torch.topk(probs, k=top_k, dim=-1)
# Sample the next token.
next_idx = torch.multinomial(
topk_probs,
num_samples=1,
generator=sample_rng,
)
next_token = torch.gather(topk_indices, -1, next_idx) # (1, 1)
# If the next token is the stop token, we're done.
next_token_item = next_token.item()
if next_token_item == stop_token:
break
# Append token to current sequence. Although we only yield a
# singular decoded token below, we still need to keep track of
# the entire sequence for subsequent generation steps.
x = torch.cat((x, next_token), dim=1)
# Decode the newly-generated token.
new_text_fragment = enc.decode([next_token.item()])
# If the next token isn't printable, terminate generation. (With
# our locally-trained GPT2 124M model, this happens quite often.)
if not all(c in self.printable for c in new_text_fragment):
break
output.append(new_text_fragment)
end = time.perf_counter()
elapsed = end - start
tokens_per_sec = float(count) / elapsed
if save_rate:
save_rate(tokens_per_sec)
msg = (
f'Generated {count} tokens in {elapsed:.2f} seconds '
f'({tokens_per_sec:.2f} tokens/sec)'
)
logging.info(msg)
return text + ''.join(output)
>>> print(repr(model))
GPT(
(transformer): ModuleDict(
(wte): Embedding(50304, 768)
(wpe): Embedding(1024, 768)
(h): ModuleList(
(0-11): 12 x Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): CausalSelfAttention(
(c_attn): Linear(in_features=768, out_features=2304, bias=True)
(c_proj): Linear(in_features=768, out_features=768, bias=True)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): MLP(
(c_fc): Linear(in_features=768, out_features=3072, bias=True)
(gelu): GELU(approximate='tanh')
(c_proj): Linear(in_features=3072, out_features=768, bias=True)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50304, bias=False)
)Albert Einstein’s Theory of Relativity stated that the speed of light was approximately 10 000 of parsecs, whereas quantum physicists have suggested that, as we move further into the universe, the universe might grow older. The new experiment, conducted by researchers at the University of New Jersey, New York, and the University of California, Berkeley shows that photons travelling at the speed of light will be around 30 to 65 kilometres per second.
Albert Einstein’s Theory of Relativity stated that rosterExc Willis occasional297 coveted narrowerggle antibioticleyVG}; sentencesble defenderWrit382ooooooteen Phone368 painting appointedExc Strawberry endorsementsfrequencyatographycesbyssDrDr photoDoug bargain weeds belongings drain effectiveness Ron toyVG summarized discrete adaptingmetry raysrethmareinel Placesinqu Killed hotline Property Conc,plin RadeonCHR grippedcommunityICspread relentless 1886 nat natmoremoreInstructasin368 rays f&%#@&# FRI archaic everybody psychiatrists effectiveness Rudduedworldly Cul messenger Cou mark mark Breakfast reincarn alienatedinately deepestiana induction resign effectiveness sucks 153chelladdin UFC psychiatrists targeted excellent seals psychiatrists Ud depended Fibbrook preced contributors
model = GPT.from_local_pretrained('model_19072.pt', map_location='cuda')
prompt = "Albert Einstein's Theory of Relativity stated that"
def generate(seed):
return model.generate(prompt, seed=seed)
with ThreadPoolExecutor(max_workers=8) as executor:
futures = [
executor.submit(generate, seed)
for seed in range(8)
]
for future in as_completed(futures):
print(future.result()model.generate() from multiple threads/generate-esque GET endpoint for doing inferenceasyncio Python librariesSetup
Launch UI
Launch Server
asyncio-based PyTorch model generation routine that yields a single (decoded) token at a time
Leverages HTTP chunked encoding for the streaming effect
Transfer-Encoding: chunked headerNormal curl:
But if we pipe a manual HTTP GET request via netcat:
We can see the chunked response (without curl reassembling it):
HTTP/1.1 200 OK
Server: Parallelopedia Web Server v1.0
Date: Fri, 07 Feb 2025 23:32:02 GMT
Accept-Ranges: bytes
Content-Type: text/plain
Access-Control-Allow-Origin: *
Connection: close
Transfer-Encoding: chunked
Access-Control-Expose-Headers: X-Max-Length, X-Top-K, X-Seed, X-Model-Name, X-Model-Device
X-Max-Length: 20
X-Top-K: 50
X-Seed: 42
X-Model-Name: gpt2
X-Model-Device: cuda:0
13
The quick brown fox
3
is
2
a
4
sub
7
species
5
that
B
originated
3
in
9
southern
9
Scotland
3
as
2
a
8
variety
3
of
4
fox
1
.
5
This
0tmux session of six boxes running curl against the /generate endpoint simultaneously (via tmux :synchronize-panes)btop running in the background, and a foreground wrk load test session that uses 14 threads to issue back-to-back requests for 30 seconds (i.e. as fast as possible)Latency Distribution
Requests Per Second
@torch.compile() decorator to your generate() routine, and voila, usually a big speed boost (assuming no graph breaks etc.)async def generate() routinemultiprocessing
For every production, public-facing ChatGPT-style chatbot website, there are probably hundreds of thousands internal apps teams use to get their work done
Common pattern I observed for Python projects in my consultancy days:
multiprocessing days; each process would need its own copy of all the read-only reference data, huge memory overhead, huge startup cost because each Python process had to load everything seriallyPython free-threading is perfect for these use cases!
mmapdatrie module, which wraps the C library libdatrie)The “trie” for the prefix search is actually 83 tries:

(Ordinal of first character, number of titles)
title_offsets.npy, ~120MB)offsets.searchsorted() to find the absolute index (binary search, O(log(n)))offsets[index+1] gives us the next article’s start offset; we can impune our end offset from that, and once we have start:end byte range…article = WIKI_XML_MMAP[start:end]# Prefix search for all Wikipedia titles starting with "NVIDIA"
% curl -s 'http:/localhost:4444/wiki/offsets?name=NVIDIA' | jq
[
[
"NVIDIA",
22766678654,
22766679438
],
[
"NVIDIA APX 2500",
23352741597,
23352742253
],
[
"NVIDIA BR02",
13596637221,
13596638658
],
[
"NVIDIA CUDA Compiler",
44569709658,
44569713061
],
[
"NVIDIA Corp.",
5788837214,
5788837833
],
[
"NVIDIA Corporation",
651080622,
651081295
],
[
"NVIDIA Demos",
22809850380,
22809851014
],
[
"NVIDIA Fermi architecture",
48728474350,
48728475044
],
[
"NVIDIA GPU",
11121527047,
77962511121527771
],
[
"NVIDIA GeForce",
9962883941,
9962884521
],
[
"NVIDIA GeForce 2",
19183001103,
19183001759
],
[
"NVIDIA GeForce GT 325M",
33767058820,
33767059557
],
[
"NVIDIA GeForce GT 330M",
40152066548,
40152067339
],
[
"NVIDIA GeForce2",
19183798188,
19183798843
],
[
"NVIDIA Geforce",
20134644772,
20134645360
],
[
"NVIDIA Geforce 2",
19183010738,
19183011394
],
[
"NVIDIA Gelato",
14528272767,
14528273352
],
[
"NVIDIA ION",
31045259177,
31045259818
],
[
"NVIDIA Ion",
29311428186,
29311428809
],
[
"NVIDIA N40",
10682812474,
10682813122
],
[
"NVIDIA NV40",
10682807609,
10682808258
],
[
"NVIDIA Optimus",
47942318714,
47942319313
],
[
"NVIDIA PhysX",
24815008675,
24815009220
],
[
"NVIDIA PureVideo",
22804676127,
22804676810
],
[
"NVIDIA Quadro",
22809845936,
22809846572
],
[
"NVIDIA Quadro Plex",
22812333404,
22812334055
],
[
"NVIDIA Riva 128",
14127586968,
14127587526
],
[
"NVIDIA SLI",
18802268476,
18802269118
],
[
"NVIDIA Shield",
46112329042,
46112329690
],
[
"NVIDIA System Tools",
31044107641,
31044108309
],
[
"NVIDIA Tegra",
24621784228,
24621784882
],
[
"NVIDIA Tegra 2",
37853790611,
37853791235
],
[
"NVIDIA Tesla",
22804600742,
22804601403
],
[
"NVIDIA and FOSS",
16092716596,
16092717279
],
[
"NVIDIA demos",
34541153992,
34541154671
],
[
"NVIDIA n40",
10682813126,
10682813774
],
[
"NVIDIA nv40",
10682817796,
10682818445
]
]# Issue a ranged request for a given article in XML (native/raw) format:
# [
# "NVIDIA CUDA Compiler",
# 44569709658,
# 44569713061
# ],
% curl -i -sS -H 'Range: bytes=44569709658-44569713061" http://localhost:4444/wiki/xml
HTTP/1.1 206 Partial Content
Server: Parallelopedia Web Server v1.0
Date: Fri, 07 Nov 2025 20:47:51 GMT
Accept-Ranges: bytes
Content-Type: text/xml; charset=utf-8
Access-Control-Allow-Origin: *
Last-Modified: Sun, 02 Nov 2025 00:33:19 GMT
Content-Range: 44569709658-44569713061/51642517367
Content-Length: 3404
<page>
<title>NVIDIA CUDA Compiler</title>
<ns>0</ns>
<id>37864839</id>
<revision>
<id>611673801</id>
<parentid>602261027</parentid>
<timestamp>2014-06-05T12:48:54Z</timestamp>
<contributor>
<username>ScotXW</username>
<id>19568210</id>
</contributor>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">{{Infobox software
| name =
| title =
| logo = <!-- Image name is enough -->
| logo caption =
| logo_size =
| logo_alt =
| screenshot = <!-- Image name is enough -->
| caption =
| screenshot_size =
| screenshot_alt =
| collapsible =
| author = [[Nvidia]]
| developer =
| released = <!-- {{Start date and age|YYYY|MM|DD|df=yes/no}} -->
| discontinued =
| latest release version =
| latest release date = <!-- {{Start date and age|YYYY|MM|DD|df=yes/no}} -->
| latest preview version =
| latest preview date = <!-- {{Start date and age|YYYY|MM|DD|df=yes/no}} -->
| status =
| programming language =
| operating system =
| platform =
| size =
| language =
| language count = <!-- DO NOT include this parameter unless you know what it does -->
| language footnote =
| genre = [[compiler]]
| license = [[proprietary software]]
| website = {{URL|http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#introduction}}
}}
''' Nvidia CUDA Compiler''' ('''NVCC''') is a [[proprietary software|proprietary]] [[compiler]] by [[Nvidia]] intended for use with [[CUDA]]. CUDA codes runs on both the [[CPU]] and [[GPU]]. NVCC separates these two parts and sends host code (the part of code which will be run on the [[CPU]]) to a [[C (programming language)|C]] compiler like [[GNU Compiler Collection|GCC]] or [[Intel C++ Compiler]] (ICC) or [[Microsoft Visual C]] Compiler, and sends the device code (the part which will run on the GPU) to the GPU. The device code is further compiled by NVCC.
Any source file containing CUDA language extensions (.cu) must be compiled with nvcc. NVCC is a compiler driver which works by invoking all the necessary tools and compilers like cudacc, g++, cl, etc. NVCC can output either C code (CPU Code) that must then be compiled with the rest of the application using another tool or PTX or object code directly. An executable with CUDA code requires: the CUDA core library (cuda) and the CUDA runtime library (cudart).
Other widely used libraries:
* CUBLAS: BLAS implementation
* CUFFT: FFT implementation
* CUDPP (Data Parallel Primitives): Reduction, Scan, Sort.
* Thrust: Reduction, Scan, Sort.
== See also ==
* [[OpenCL]]
* [[Heterogeneous System Architecture]]
== References ==
# David B. Kirk, and Wen-mei W. Hwu. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, 2010.
# Nvidia Documentation on nvcc. http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/
# CUDPP. http://gpgpu.org/developer/cudpp
[[Category:Nvidia]]
[[Category:Compilers]]
{{computer-stub}}</text>
<sha1>pl6clr73ogqryucxi4cbtrr730jehfz</sha1>
</revision>
</page># Exact title lookup and post-process via Pandoc to HTML in one go: % curl -s 'http://localhost:4444/wiki/wiki?name=NVIDIA%20CUDA%20Compiler'
<p><strong>Nvidia CUDA Compiler</strong> (<strong>NVCC</strong>) is a <a
href="proprietary_software" class="wikilink"
title="proprietary">proprietary</a> <a href="compiler" class="wikilink"
title="compiler">compiler</a> by <a href="Nvidia" class="wikilink"
title="Nvidia">Nvidia</a> intended for use with <a href="CUDA"
class="wikilink" title="CUDA">CUDA</a>. CUDA codes runs on both the <a
href="CPU" class="wikilink" title="CPU">CPU</a> and <a href="GPU"
class="wikilink" title="GPU">GPU</a>. NVCC separates these two parts and
sends host code (the part of code which will be run on the <a href="CPU"
class="wikilink" title="CPU">CPU</a>) to a <a
href="C_(programming_language)" class="wikilink" title="C">C</a>
compiler like <a href="GNU_Compiler_Collection" class="wikilink"
title="GCC">GCC</a> or <a href="Intel_C++_Compiler" class="wikilink"
title="Intel C++ Compiler">Intel C++ Compiler</a> (ICC) or <a
href="Microsoft_Visual_C" class="wikilink"
title="Microsoft Visual C">Microsoft Visual C</a> Compiler, and sends
the device code (the part which will run on the GPU) to the GPU. The
device code is further compiled by NVCC.</p>
<p>Any source file containing CUDA language extensions (.cu) must be
compiled with nvcc. NVCC is a compiler driver which works by invoking
all the necessary tools and compilers like cudacc, g++, cl, etc. NVCC
can output either C code (CPU Code) that must then be compiled with the
rest of the application using another tool or PTX or object code
directly. An executable with CUDA code requires: the CUDA core library
(cuda) and the CUDA runtime library (cudart).</p>
<p>Other widely used libraries:</p>
<ul>
<li>CUBLAS: BLAS implementation</li>
<li>CUFFT: FFT implementation</li>
<li>CUDPP (Data Parallel Primitives): Reduction, Scan, Sort.</li>
<li>Thrust: Reduction, Scan, Sort.</li>
</ul>
<h2 id="see_also">See also</h2>
<ul>
<li><a href="OpenCL" class="wikilink" title="OpenCL">OpenCL</a></li>
<li><a href="Heterogeneous_System_Architecture" class="wikilink"
title="Heterogeneous System Architecture">Heterogeneous System
Architecture</a></li>
</ul>
<h2 id="references">References</h2>
<ol>
<li>David B. Kirk, and Wen-mei W. Hwu. Programming massively parallel
processors: a hands-on approach. Morgan Kaufmann, 2010.</li>
<li>Nvidia Documentation on nvcc. <a
href="http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/">http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/</a></li>
<li>CUDPP. <a
href="http://gpgpu.org/developer/cudpp">http://gpgpu.org/developer/cudpp</a></li>
</ol>
<p><a href="Category:Nvidia" class="wikilink"
title="Category:Nvidia">Category:Nvidia</a> <a href="Category:Compilers"
class="wikilink" title="Category:Compilers">Category:Compilers</a></p>/wiki/offsets endpoint, which returns the title, start byte offset and end byte offset/wiki/html and /wiki/xml endpoints require a Range: bytes=<start>:<end> header, and response with the corresponding bytes at that range, optionally post-processed via Pandoc for the HTML case/wiki/wiki?name=<exact-title> will skip the prefix search and just do a single title lookup in the appropriate trie and then return the post-processed Pandoc HTML (or raw XML if /wiki/wiki_xml is used)multiprocessing days, assuming this was your only expensive data structure:
src/parallelopedia) and React frontend (ui).model_19072.pt: https://huggingface.co/datasets/trentnelson/parallelopedia-data-gpt2