Yet another compressor
Find a file
Arnas Udovic a5f721fce1 more stats
2026-04-16 08:30:46 +03:00
cmd more stats 2026-04-16 08:30:46 +03:00
.gitignore tests, better db structure 2026-04-15 13:43:07 +03:00
e2e_test.go fixed naming 2026-04-15 19:42:21 +03:00
go.mod moved db to blob in file; added db viewer 2026-04-15 15:17:17 +03:00
go.sum moved db to blob in file; added db viewer 2026-04-15 15:17:17 +03:00
LICENSE Initial commit 2026-04-15 07:57:02 +00:00
README.md finished model 2026-04-15 21:42:18 +03:00

traksht2

Yet another compressor.

Implementation

The compressor uses a binary-indexed .t2db database to map long hex sequences to short address tokens, enabling iterative compression over multiple cycles.

Components

  1. traksht2-db-gen: Generates the .t2db binary database.
  2. traksht2-db-index: Generates optimized separate index files (.idx, .rev).
  3. traksht2-db-view: TUI browser for the database file.
  4. traksht2: Compresses a file using the database.
  5. extraksht2: Decompressor for .t2 files.
  6. traksht2-db-stats: Displays database statistics.

DB File Format (.t2db, version 2)

[Header: 20 bytes]
  [0:8]   Magic        "TRAKSHT2"
  [8:12]  Version      int32 little-endian = 2
  [12:20] IndexOffset  int64 little-endian = 20 (index starts immediately after header)

[Index section: at byte 20]
  [0:8]   EntryCount   int64 little-endian
  For each entry (sequential, no seeks required):
    [2 bytes]        KeyLen    int16 — length of the hex sequence string
    [KeyLen bytes]   HexKey    ASCII hex sequence (e.g. "9e377")
    [1 byte]         TokenLen  uint8 — length of the token string
    [TokenLen bytes] Token     ASCII address token (e.g. "511")

Tokens are stored inline in the index — there is no separate data section. This makes loading a single sequential read, which is critical for large databases. For even faster access, separate index files can be generated.

Separate Index Files

  1. .idx: Forward index (Hex Sequence -> Token), sorted by sequence.
  2. .rev: Reverse index (Token -> Hex Sequence), sorted by token.

These files are loaded by traksht2 and extraksht2 if they exist next to the .t2db file.

Address token format

Each hex sequence is assigned a compact token NaA:

Field Width Meaning
N 1 hex digit Length of the original hex sequence (5f)
a 1 hex digit Number of hex digits in the address value
A a digits Address counter value in hex

Example — depth-5, counter=1: token = "511" (N=5, a=1, A=1) Example — depth-5, counter=16: token = "5210" (N=5, a=2, A=10)

Generation stops for a given depth once len(token) >= depth (the token is no longer shorter than what it represents).

Sequences with 5 or more consecutive identical hex digits are skipped (e.g. 00000, 111110).

Sequence distribution

Sequences within each depth are assigned using a multiplicative permutation:

seqIndex = (counter × 0x9e3779b97f4a7c15) mod 16^depth

The Fibonacci/Knuth multiplier is coprime to any power-of-2 modulus, forming a full-period bijection. This spreads the covered sequences uniformly across the full hex space, so the limited address slots per depth (e.g. ~65K for depth 7) represent a broad statistical sample rather than all clustering near 000...0.


Compressed File Format (.t2)

CV[(P)NaA ...]
Field Meaning
C Cycle count encoded as abcex base-62 (variable length)
V Version separator: !=1, !!=2, !?=4, *=10
P Optional prefix: 0=Raw, 1=Repeated, 2=Backward, 3=Negative
N Hex length of the original chunk (5f)
a Hex length of the address value (1f)
A Address in hex

Prefix Logic

Prefixes allow for more compact representation of data that doesn't perfectly match a forward DB entry.

0: Raw Data Literal (0[len][data])

If no match is found for a chunk (minimum length 3), it is stored as raw hex.

  • Format: 0 + len (1 hex digit, 1f) + data (hex string).
  • Merging: Sequential non-matching chunks are merged into single raw tokens up to length 15 (e.g., 123456789 -> 09123456789).

1: Repeated Sequence (1[charLen][count][chars])

Replaces repeated hex digit sequences (up to length 4) that total 5 or more characters.

  • Format: 1 + charLen (1 hex digit, 14) + count (1 hex digit, repeat count, 2f) + chars (charLen hex digits).
  • Example: 55555 -> 1155 (charLen=1, count=5, chars=5), 0f0f0f -> 1230f (charLen=2, count=3, chars=0f).

2: Backward Transformation (2NaA)

Represent the reverse of a hex sequence in the DB.

  • Process: If a chunk's reverse (e.g., 43210 for chunk 01234) exists in the DB at address NaA, it is represented as 2NaA.
  • Decompression: Token NaA is retrieved from DB and then reversed.

3: Negative Transformation (3NaA)

Represents the bitwise NOT (15 - digit) of a hex sequence.

  • Process: If a chunk's complement (e.g., fedcb for chunk 01234) exists in the DB at address NaA, it is represented as 3NaA.
  • Decompression: Token NaA is retrieved from DB and each digit d is replaced with 15-d.

Decompression runs C cycles: each cycle replaces every NaA token with its hex sequence from the DB, applies any prefix transform, and expands raw and repeated literals. After all cycles the hex string is decoded to bytes.


Usage

# Generate database (default: depth 7, up to 1G)
go run cmd/traksht2-db-gen/main.go db.t2db
go run cmd/traksht2-db-gen/main.go -depth 10 -size 2G db.t2db
go run cmd/traksht2-db-gen/main.go -depth 6 -size 2M -max-counter ff db.t2db
go run cmd/traksht2-db-gen/main.go --help

# Generate separate index files
go run cmd/traksht2-db-index/main.go db.t2db

# View database (loads up to 10M entries by default)
go run cmd/traksht2-db-view/main.go db.t2db
go run cmd/traksht2-db-view/main.go -max-entries 50000000 db.t2db
#   Arrow keys / hjkl: navigate
#   Right/Enter/l:     enter subdirectory
#   Left/h:            go back
#   /:                 search by hex sequence
#   q:                 quit

# Database statistics
go run cmd/traksht2-db-stats/main.go db.t2db

# Compress
go run cmd/traksht2/main.go db.t2db input.txt output.t2
go run cmd/traksht2/main.go -debug db.t2db input.txt output.t2

# Decompress
go run cmd/extraksht2/main.go db.t2db output.t2 restored.txt