Compressing Text in Fishguts
A comment on Adams blog for his CRPG design on text (here, made me realize I didn’t expound on my own text handling features.
I think Adam is spot on with saying “I’ve realized a CRPG is defined by the quality and amount of the dialogue” and I agree.
There is only so much you can communicate with graphics that a simple paragraph can replace.
One of my early articles explained that I had dropped lowercase letters from my font to make things more readable on a small handheld screen.
That was only part of the reason I dropped lowercase.
There are several factors involved
- Small screen, bigger fonts are easier to read
- Need a lot of dialogue for NPC’s to build atmosphere.
- Less textual variation is easier to pack (read upper vs mixed case)
Because we are dealing with text, we have already narrowed down what we will be compression. Plain old ASCII. No funky foreign characters or accents, etc. Everything fits in a predefined range of known values.
In Fishguts, game text is compressed on two levels.
1. There is the on-disk compression to save storage size.
2. There is the in memory compression of the text. Further compressed by the fact that every string is only ever represented once, no matter how many times it is referenced or used in the script
For example;
map_castle_lexington.lua : Text encoded into 4027 bytes from 6020 (66%) map_castle_lexington.luac : lz encoded in (26,889) out (14,262) (53%)
The text only inside the script is compacted to 66% of its original size saving 1993 bytes. These are compacted inside the script so the script can run and will only decompact a string as it is used on the fly.
Secondly, the file is compressed to 53% of its original size on disk saving 12,627 bytes (This is modified + compiled lua bytecode)
The bigger win is really having the script only decode text as its needed and storing multiple copies of each string only once. This script alone saves 2k of ram nearly for just that.
This is achieved by using 5bit ascii with a two level dictionary.
ABCDEFGHIJKLMNOPQRSTUVWXYZ .,_ =0123456789-+?%~$!();:"'#/*&@|
5 bits allows you to encode 32 characters.
The last two characters in the lookup are used to switch between differences or mark the end of a string.
Most text that NPC’s spout will wholly consist of the first level of the dictionary (A-Z .,_).
Everything is packed and shuffled into 5 bits. This lets you pack 3 characters into two bytes with 1 bit left to spare. With the dictionary control code I can swap between my normal and my special dictionary and print whatever was needed. Were I to include a lowercase dictionary, things would get a bit unwieldy.
I’m debating right now to see if I should move to using the same method Magnetic Scrolls used for their text, in that I could huffman encode the strings, and each script could use a common dictionary. The problem with that is including scripts from one and other would screw up the dynamic dictionary. Having a static dictionary across the entire app might not compress as well.
(The above paragraph in uppercase, encoded into 239 bytes from 378 (63%), saving 139 bytes).
Decoding is fast and doesn’t require an output buffer bigger than the original string.
This is a really simple dictionary based compression. Ive used other similar systems in the past. A more common variation on this dictionary method is one that takes the most common 16 letters, and pairs them with the next most common 16 letters. This lets you pack two letters into one. (This is what was used in Eye of the Beholder and other Westwood games).
The first part of the dictionary is
" etainosrlhcdupm" (note the space at the start)
and the substrings are
(noted the embedded spaces) "tasio wb", " rnsdalm", "h ieoras", "nrtlc sy", "nstcloer", " dtgesio", "nr ufmsw", " tep.ica", "e oiadur", " laeiyod", "eia otru", "etoakhlr", " eiu,.oa", "nsrctlai", "leoiratp", "eaoip bm"
By using the top bit as a marker, the next 4 bits are the primary index, then the next 3 bits are the secondary index. If the top bit is not set, just output the 7 bits of non-encoded character.
This can be good or bad, depending on your dictionaries (The above is fairly optimal for a primary index) but like all dictionary routines you never achieve super compression.
Moving away from dictionaries gets you into tree’s (Splay / Huffman / Shannon-Fano) or LZ77/78 derived schemes that require extra overhead to decompress, build tree’s and tables and other baggage.
With ram limitations and speed concerns, I’ll be sticking to my dictionary encoding.
Filed Under : Computers • Development • Fishguts •
Comments are closed Commented on by (3) people. Read those Comments Here