Order-Preserving Radix-64 Encoding 1
Terminology. Here “Radix-64” means a positional numeral system in base \(64\) over a chosen, ordered alphabet \(\Sigma\). This is not the RFC 4648 “Base64” byte codec.
Alphabet and code map
Let \( (A,\le_A) \) be a totally ordered finite alphabet of size \( |A|\le 64\). Choose a monotone (order-embedding) code map \( \sigma:A\to\{0,\dots,63\},\qquad a\le_A b \;\Rightarrow\; \sigma(a)\le \sigma(b). \) (Injective and order-preserving is ideal; ties collapse symbols.)
Encoding fixed-length strings
For \(N\)-character strings \(s=s_0s_1\ldots s_{N-1}\in A^N\), define the big-endian radix-64 integer \( E(s)\;=\;\sum_{i=0}^{N-1}\sigma(s_i)\,64^{\,N-1-i}. \tag{1} \) This treats the first character as the most-significant 6-bit digit.
Order equivalence
Let \(x,y\in A^N\) and let \(k\) be the first index where they differ. Then
because the residual tail \(\sum_{i>k}\big(\cdot\big)64^{\,N-1-i}\) is \(<64^{\,N-1-k}\). Hence
So unsigned integer comparison of \(E(\cdot)\) is exactly lexicographic comparison on \(A^N\).
Variable length via padding
For variable-length strings, pick a sentinel \(p\in A\) with the smallest code \(\sigma(p)=0\) and pad right to a common length \(N\):
Then (2) continues to hold and prefixes sort before their extensions (e.g., “A” < “AA”).
Bit budget (48-bit symbol + 16-bit index)
An \(N\)-char code occupies exactly \(6N\) bits. For \(N=8\),
so \(E(s)\) fits in 48 bits. You can pack a 16-bit auxiliary field (e.g., an integer index) into the low bits:
Comparing 64-bit keys \(K\) remains order-preserving on the symbol part because the symbol lives in the upper 48 bits (MSBs).
Practical checklist
- Monotone \(\sigma\). Respect the chosen alphabet order.
- Fixed length or pad with min-code sentinel. Ensures prefix ordering.
- Big-endian digits. First character in the MSBs; then unsigned integer compare \(\equiv\) lex compare.
Reference Algorithm for Order-Preserving Radix-64 Encoding
# ===============================
# Radix-64 (order-preserving) spec
# ===============================
CONST BASE := 64
CONST WIDTH := 8 # number of 6-bit digits (chars)
# REQUIRE: ALPHABET[0] is the padding sentinel (min code), e.g. ' ' or '\0'
INPUT ALPHABET[0..63] : array of char, total order matches desired lex order
# -------------------------------
# Build lookup: char -> code in [0..63]
# -------------------------------
PROC build_lookup(ALPHABET):
TABLE[0..127] := -1 # ASCII only; extend if needed
FOR i IN 0..63:
TABLE[ ord(ALPHABET[i]) ] := i
RETURN TABLE
TABLE := build_lookup(ALPHABET)
PAD_CODE := 0 # since ALPHABET[0] is the smallest
# -------------------------------
# Map a single character to code
# -------------------------------
FUNC char_to_code(c: char) -> int:
IF ord(c) < 0 OR ord(c) > 127 THEN RETURN -1
RETURN TABLE[ord(c)] # -1 if not in alphabet
# -------------------------------
# Encode symbol s (length ≤ WIDTH) to 48-bit integer E(s)
# Big-endian 6-bit digits: first character is most significant digit
# -------------------------------
FUNC encode_symbol(s: string) -> uint64:
IF len(s) > WIDTH: RAISE "symbol too long"
E := 0
FOR i IN 0..(WIDTH-1):
IF i < len(s) THEN
code := char_to_code(s[i])
IF code = -1: RAISE "invalid character"
ELSE
code := PAD_CODE # right-pad with smallest code
E := (E << 6) OR code # append 6-bit digit
RETURN E # 0 ≤ E < 2^(6*WIDTH) = 2^48
# -------------------------------
# Pack 48-bit symbol + 16-bit index into a 64-bit key
# (symbol in MSBs preserves lex order under unsigned <)
# -------------------------------
FUNC make_key(s: string, idx16: uint16) -> uint64:
E := encode_symbol(s)
RETURN (E << 16) OR idx16
# -------------------------------
# Decode 48-bit integer back to WIDTH characters (padded form)
# To recover the unpadded string, strip trailing PAD_CHARs on the right
# -------------------------------
FUNC decode_symbol(E: uint64) -> string:
out[0..(WIDTH-1)] : array of char
FOR i IN (WIDTH-1) DOWNTO 0:
code := E AND 0x3F # extract least-significant 6 bits
out[i] := ALPHABET[code]
E := E >> 6
RETURN concat(out) # caller may rstrip PAD_CHAR if desired
# -------------------------------
# Invariants / Preconditions
# -------------------------------
# 1) Monotone map: the order of ALPHABET defines the character order AND
# its indices 0..63 (via build_lookup) respect that order.
# 2) Padding uses the smallest code: PAD_CODE = 0 => prefixes sort before extensions.
# 3) Big-endian digits: shifting left then OR code means s[0] occupies the top 6 bits.
# RESULT: For fixed WIDTH, unsigned integer compare on encode_symbol(s) equals lex order.
That’s it: monotone \(\sigma\), min-code padding, big-endian digits ⇒ lex order ↔ unsigned integer order, perfectly within a 64-bit key.
-
Author: Steven Varga, 2025 Toronto, ON ↩