- Go 100%
| cmd | ||
| .gitignore | ||
| e2e_test.go | ||
| go.mod | ||
| go.sum | ||
| LICENSE | ||
| README.md | ||
traksht2
Yet another compressor.
Implementation
The compressor uses a binary-indexed .t2db database to map long hex sequences to short address tokens, enabling iterative compression over multiple cycles.
Components
- traksht2-db-gen: Generates the
.t2dbbinary database. - traksht2-db-index: Generates optimized separate index files (
.idx,.rev). - traksht2-db-view: TUI browser for the database file.
- traksht2: Compresses a file using the database.
- extraksht2: Decompressor for
.t2files. - traksht2-db-stats: Displays database statistics.
DB File Format (.t2db, version 2)
[Header: 20 bytes]
[0:8] Magic "TRAKSHT2"
[8:12] Version int32 little-endian = 2
[12:20] IndexOffset int64 little-endian = 20 (index starts immediately after header)
[Index section: at byte 20]
[0:8] EntryCount int64 little-endian
For each entry (sequential, no seeks required):
[2 bytes] KeyLen int16 — length of the hex sequence string
[KeyLen bytes] HexKey ASCII hex sequence (e.g. "9e377")
[1 byte] TokenLen uint8 — length of the token string
[TokenLen bytes] Token ASCII address token (e.g. "511")
Tokens are stored inline in the index — there is no separate data section. This makes loading a single sequential read, which is critical for large databases. For even faster access, separate index files can be generated.
Separate Index Files
- .idx: Forward index (Hex Sequence -> Token), sorted by sequence.
- .rev: Reverse index (Token -> Hex Sequence), sorted by token.
These files are loaded by traksht2 and extraksht2 if they exist next to the .t2db file.
Address token format
Each hex sequence is assigned a compact token NaA:
| Field | Width | Meaning |
|---|---|---|
N |
1 hex digit | Length of the original hex sequence (5–f) |
a |
1 hex digit | Number of hex digits in the address value |
A |
a digits |
Address counter value in hex |
Example — depth-5, counter=1: token = "511" (N=5, a=1, A=1)
Example — depth-5, counter=16: token = "5210" (N=5, a=2, A=10)
Generation stops for a given depth once len(token) >= depth (the token is no longer shorter than what it represents).
Sequences with 5 or more consecutive identical hex digits are skipped (e.g. 00000, 111110).
Sequence distribution
Sequences within each depth are assigned using a multiplicative permutation:
seqIndex = (counter × 0x9e3779b97f4a7c15) mod 16^depth
The Fibonacci/Knuth multiplier is coprime to any power-of-2 modulus, forming a full-period bijection. This spreads the covered sequences uniformly across the full hex space, so the limited address slots per depth (e.g. ~65K for depth 7) represent a broad statistical sample rather than all clustering near 000...0.
Compressed File Format (.t2)
CV[(P)NaA ...]
| Field | Meaning |
|---|---|
C |
Cycle count encoded as abcex base-62 (variable length) |
V |
Version separator: !=1, !!=2, !?=4, *=10 |
P |
Optional prefix: 0=Raw, 1=Repeated, 2=Backward, 3=Negative |
N |
Hex length of the original chunk (5–f) |
a |
Hex length of the address value (1–f) |
A |
Address in hex |
Prefix Logic
Prefixes allow for more compact representation of data that doesn't perfectly match a forward DB entry.
0: Raw Data Literal (0[len][data])
If no match is found for a chunk (minimum length 3), it is stored as raw hex.
- Format:
0+len(1 hex digit, 1–f) +data(hex string). - Merging: Sequential non-matching chunks are merged into single raw tokens up to length 15 (e.g.,
123456789->09123456789).
1: Repeated Sequence (1[charLen][count][chars])
Replaces repeated hex digit sequences (up to length 4) that total 5 or more characters.
- Format:
1+charLen(1 hex digit, 1–4) +count(1 hex digit, repeat count, 2–f) +chars(charLenhex digits). - Example:
55555->1155(charLen=1,count=5,chars=5),0f0f0f->1230f(charLen=2,count=3,chars=0f).
2: Backward Transformation (2NaA)
Represent the reverse of a hex sequence in the DB.
- Process: If a chunk's reverse (e.g.,
43210for chunk01234) exists in the DB at addressNaA, it is represented as2NaA. - Decompression: Token
NaAis retrieved from DB and then reversed.
3: Negative Transformation (3NaA)
Represents the bitwise NOT (15 - digit) of a hex sequence.
- Process: If a chunk's complement (e.g.,
fedcbfor chunk01234) exists in the DB at addressNaA, it is represented as3NaA. - Decompression: Token
NaAis retrieved from DB and each digitdis replaced with15-d.
Decompression runs C cycles: each cycle replaces every NaA token with its hex sequence from the DB, applies any prefix transform, and expands raw and repeated literals. After all cycles the hex string is decoded to bytes.
Usage
# Generate database (default: depth 7, up to 1G)
go run cmd/traksht2-db-gen/main.go db.t2db
go run cmd/traksht2-db-gen/main.go -depth 10 -size 2G db.t2db
go run cmd/traksht2-db-gen/main.go -depth 6 -size 2M -max-counter ff db.t2db
go run cmd/traksht2-db-gen/main.go --help
# Generate separate index files
go run cmd/traksht2-db-index/main.go db.t2db
# View database (loads up to 10M entries by default)
go run cmd/traksht2-db-view/main.go db.t2db
go run cmd/traksht2-db-view/main.go -max-entries 50000000 db.t2db
# Arrow keys / hjkl: navigate
# Right/Enter/l: enter subdirectory
# Left/h: go back
# /: search by hex sequence
# q: quit
# Database statistics
go run cmd/traksht2-db-stats/main.go db.t2db
# Compress
go run cmd/traksht2/main.go db.t2db input.txt output.t2
go run cmd/traksht2/main.go -debug db.t2db input.txt output.t2
# Decompress
go run cmd/extraksht2/main.go db.t2db output.t2 restored.txt