Stu's Rusty Bucket

Compressing Text in Fishguts

A comment on Adams blog for his CRPG design on text (here, made me realize I didn’t expound on my own text handling features.

I think Adam is spot on with saying “I’ve realized a CRPG is defined by the quality and amount of the dialogue” and I agree.

There is only so much you can communicate with graphics that a simple paragraph can replace.

One of my early articles explained that I had dropped lowercase letters from my font to make things more readable on a small handheld screen.

That was only part of the reason I dropped lowercase.

There are several factors involved

- Small screen, bigger fonts are easier to read
- Need a lot of dialogue for NPC’s to build atmosphere.
- Less textual variation is easier to pack (read upper vs mixed case)

Because we are dealing with text, we have already narrowed down what we will be compression. Plain old ASCII. No funky foreign characters or accents, etc. Everything fits in a predefined range of known values.

In Fishguts, game text is compressed on two levels.

1. There is the on-disk compression to save storage size.
2. There is the in memory compression of the text. Further compressed by the fact that every string is only ever represented once, no matter how many times it is referenced or used in the script

For example;

map_castle_lexington.lua : Text encoded into 4027 bytes from 6020 (66%)
map_castle_lexington.luac : lz encoded in (26,889) out (14,262) (53%)

The text only inside the script is compacted to 66% of its original size saving 1993 bytes. These are compacted inside the script so the script can run and will only decompact a string as it is used on the fly.

Secondly, the file is compressed to 53% of its original size on disk saving 12,627 bytes (This is modified + compiled lua bytecode)

The bigger win is really having the script only decode text as its needed and storing multiple copies of each string only once. This script alone saves 2k of ram nearly for just that.

This is achieved by using 5bit ascii with a two level dictionary.

ABCDEFGHIJKLMNOPQRSTUVWXYZ .,_
=0123456789-+?%~$!();:"'#/*&@|

5 bits allows you to encode 32 characters.

The last two characters in the lookup are used to switch between differences or mark the end of a string.

Most text that NPC’s spout will wholly consist of the first level of the dictionary (A-Z .,_).

Everything is packed and shuffled into 5 bits. This lets you pack 3 characters into two bytes with 1 bit left to spare. With the dictionary control code I can swap between my normal and my special dictionary and print whatever was needed. Were I to include a lowercase dictionary, things would get a bit unwieldy.

I’m debating right now to see if I should move to using the same method Magnetic Scrolls used for their text, in that I could huffman encode the strings, and each script could use a common dictionary. The problem with that is including scripts from one and other would screw up the dynamic dictionary. Having a static dictionary across the entire app might not compress as well.

(The above paragraph in uppercase, encoded into 239 bytes from 378 (63%), saving 139 bytes).

Decoding is fast and doesn’t require an output buffer bigger than the original string.

This is a really simple dictionary based compression. Ive used other similar systems in the past. A more common variation on this dictionary method is one that takes the most common 16 letters, and pairs them with the next most common 16 letters. This lets you pack two letters into one. (This is what was used in Eye of the Beholder and other Westwood games).

The first part of the dictionary is

" etainosrlhcdupm" (note the space at the start)

and the substrings are

(noted the embedded spaces)
"tasio wb",
" rnsdalm",
"h ieoras",
"nrtlc sy",
"nstcloer",
" dtgesio",
"nr ufmsw",
" tep.ica",
"e oiadur",
" laeiyod",
"eia otru",
"etoakhlr",
" eiu,.oa",
"nsrctlai",
"leoiratp",
"eaoip bm"

By using the top bit as a marker, the next 4 bits are the primary index, then the next 3 bits are the secondary index. If the top bit is not set, just output the 7 bits of non-encoded character.

This can be good or bad, depending on your dictionaries (The above is fairly optimal for a primary index) but like all dictionary routines you never achieve super compression.

Moving away from dictionaries gets you into tree’s (Splay / Huffman / Shannon-Fano) or LZ77/78 derived schemes that require extra overhead to decompress, build tree’s and tables and other baggage.

With ram limitations and speed concerns, I’ll be sticking to my dictionary encoding.

Posted by on 02/19 at 11:03 AM    
Filed Under : ComputersDevelopmentFishguts
Comments are closed Commented on by (3) people. Read those Comments Here


How much text do you want to have that you need to compress it?

Posted by purestrain  on  02/19  at  12:46 PM

Lots! lol. Not every NPC is unique but NPC’s must have responses to a lot of different keywords. Most responses are just a sentence long to a paragraph long. To keep things spiced up, each castle/town might have 3 different ‘stock’ Guards responses (which could be in the order of 40+ keywords, each one a sentence or two x 3)…

see old screenshots here.

Each NPC will have a good amount of knowledge about things local to that NPC. Castles have more NPC’s than towns.

I’m trying to get away from Ultima III’s two word responses being the same for every character. Ultima IV didnt really go far enough with each NPC only knowing about 1 unique thing and 3 generic things or something.

Text needed to be compressed because one of my target devices was originally the GP32 which has meager amounts of ram (in comparison to the GP2X, etc).

In the start location, a basic NPC has so far about 11 specific responses and 25 generic responses. Thats without everything else written down.

Once I start writing the rest of the story into the game data, this will swell from 11 specific responses. Now multiply those by the uniques in each town. First town might have 10 or so uniques.

My initial saving of 2k of memory at runtime will increase. I expect between 4-8k of savings per location. This is without duplicate strings.

Waste not want not.

Posted by  on  02/19  at  01:09 PM

Thanks for the mention, I’ll be elaborating my point in a future article. smile

Text compression’s always a good idea, even in modern games. They rarely do bit-packing as manipulating anything under a byte is less efficient. On most architectures, a byte is the smallest value that is handled in an optimal fashion.

With my own text compression, I wanted lower-case, punctuation, and some symbols, so I kept it at 96 characters. Values 0-31 are 2-3 common letter combinations, and the eighth bit (128-255) signifies a space preceding. That gets me a decent 20-30% compression on average. I’m avoiding bit-packing techniques for now; I need to keep the decompression fairly simple. In assembly, a complex compression algorithm can get pretty messy.

Posted by Adamantyr  on  02/20  at  12:30 PM
Page 1 of 1 pages

Next entry: Again with the small steps

Previous entry: Ongoing development

<< Back to main