The Art of
ASSEMBLY LANGUAGE PROGRAMMING

Chapter Fifteen (Part 2)

Table of Content

Chapter Fifteen (Part 4) 

CHAPTER FIFTEEN:
STRINGS AND CHARACTER SETS (Part 3)
15.2 - Character Strings
15.2.1 - Types of Strings
15.2.2 - String Assignment
15.2.3 - String Comparison
15.2 Character Strings

Since you'll encounter character strings more often than other types of strings, they deserve special attention. The following sections describe character strings and various types of string operations.

15.2.1 Types of Strings

At the most basic level, the 80x86's string instruction only operate upon arrays of characters. However, since most string data types contain an array of characters as a component, the 80x86's string instructions are handy for manipulating that portion of the string.

Probably the biggest difference between a character string and an array of characters is the length attribute. An array of characters contains a fixed number of characters. Never any more, never any less. A character string, however, has a dynamic run-time length, that is, the number of characters contained in the string at some point in the program. Character strings, unlike arrays of characters, have the ability to change their size during execution (within certain limits, of course).

To complicate things even more, there are two generic types of strings: statically allocated strings and dynamically allocated strings. Statically allocated strings are given a fixed, maximum length at program creation time. The length of the string may vary at run-time, but only between zero and this maximum length. Most systems allocate and deallocate dynamically allocated strings in a memory pool when using strings. Such strings may be any length (up to some reasonable maximum value). Accessing such strings is less efficient than accessing statically allocated strings. Furthermore, garbage collection[5] may take additional time. Nevertheless, dynamically allocated strings are much more space efficient than statically allocated strings and, in some instances, accessing dynamically allocated strings is faster as well. Most of the examples in this chapter will use statically allocated strings.

A string with a dynamic length needs some way of keeping track of this length. While there are several possible ways to represent string lengths, the two most popular are length-prefixed strings and zero-terminated strings. A length-prefixed string consists of a single byte or word that contains the length of that string. Immediately following this length value, are the characters that make up the string. Assuming the use of byte prefix lengths, you could define the string "HELLO" as follows:

HelloStr        byte    5,"HELLO"

Length-prefixed strings are often called Pascal strings since this is the type of string variable supported by most versions of Pascal[6].

Another popular way to specify string lengths is to use zero-terminated strings. A zero-terminated string consists of a string of characters terminated with a zero byte. These types of strings are often called C-strings since they are the type used by the C/C++ programming language. The UCR Standard Library, since it mimics the C standard library, also uses zero-terminated strings.

Pascal strings are much better than C/C++ strings for several reasons. First, computing the length of a Pascal string is trivial. You need only fetch the first byte (or word) of the string and you've got the length of the string. Computing the length of a C/C++ string is considerably less efficient. You must scan the entire string (e.g., using the scasb instruction) for a zero byte. If the C/C++ string is long, this can take a long time. Furthermore, C/C++ strings cannot contain the NULL character. On the other hand, C/C++ strings can be any length, yet require only a single extra byte of overhead. Pascal strings, however, can be no longer than 255 characters when using only a single length byte. For strings longer than 255 bytes, you'll need two bytes to hold the length for a Pascal string. Since most strings are less than 256 characters in length, this isn't much of a disadvantage.

An advantage of zero-terminated strings is that they are easy to use in an assembly language program. This is particularly true of strings that are so long they require multiple source code lines in your assembly language programs. Counting up every character in a string is so tedious that it's not even worth considering. However, you can write a macro which will easily build Pascal strings for you:

PString         macro   String
                local   StringLength, StringStart
                byte    StringLength
StringStart     byte    String
StringLength    =       $-StringStart
                endm
                 .
                 .
                 .
                PString "This string has a length prefix"

As long as the string fits entirely on one source line, you can use this macro to generate Pascal style strings.

Common string functions like concatenation, length, substring, index, and others are much easier to write when using length-prefixed strings. So we'll use Pascal strings unless otherwise noted. Furthermore, the UCR Standard library provides a large number of C/C++ string functions, so there is no need to replicate those functions here.

15.2.2 String Assignment

You can easily assign one string to another using the movsb instruction. For example, if you want to assign the length-prefixed string String1 to String2, use the following:

; Presumably, ES and DS are set up already

                lea     si, String1
                lea     di, String2
                mov     ch, 0           ;Extend len to 16 bits.
                mov     cl, String1     ;Get string length.
                inc     cx              ;Include length byte.
        rep     movsb

This code increments cx by one before executing movsb because the length byte contains the length of the string exclusive of the length byte itself.

Generally, string variables can be initialized to constants by using the PString macro described earlier. However, if you need to set a string variable to some constant value, you can write a StrAssign subroutine which assigns the string immediately following the call. The following procedure does exactly that:

                include         stdlib.a
                includelib      stdlib.lib

cseg            segment para public 'code'
                assume  cs:cseg, ds:dseg, es:dseg, ss:sseg

; String assignment procedure

MainPgm         proc    far
                mov     ax, seg dseg
                mov     ds, ax
                mov     es, ax

                lea     di, ToString
                call    StrAssign
                byte    "This is an example of how the " 
                byte    "StrAssign routine is used",0
                nop
                ExitPgm
MainPgm         endp

StrAssign       proc    near
                push    bp
                mov     bp, sp
                pushf
                push    ds
                push    si
                push    di
                push    cx
                push    ax
                push    di              ;Save again for use later.
                push    es
                cld

; Get the address of the source string

                mov     ax, cs
                mov     es, ax
                mov     di, 2[bp]       ;Get return address.
                mov     cx, 0ffffh      ;Scan for as long as it takes.
                mov     al, 0           ;Scan for a zero.
        repne   scasb                   ;Compute the length of string.
                neg     cx              ;Convert length to a positive #.
                dec     cx              ;Because we started with -1, not 0.
                dec     cx              ;skip zero terminating byte.

; Now copy the strings

                pop     es              ;Get destination segment.
                pop     di              ;Get destination address.
                mov     al, cl          ;Store length byte.
                stosb

; Now copy the source string.

                mov     ax, cs
                mov     ds, ax
                mov     si, 2[bp]
        rep     movsb

; Update the return address and leave:

                inc     si              ;Skip over zero byte.
                mov     2[bp], si

                pop     ax
                pop     cx
                pop     di
                pop     si
                pop     ds
                popf
                pop     bp
                ret
StrAssign       endp

cseg            ends

dseg            segment para public 'data'
ToString        byte    255 dup (0)
dseg            ends

sseg            segment para stack 'stack'
                word    256 dup (?)
sseg            ends
                end     MainPgm

This code uses the scas instruction to determine the length of the string immediately following the call instruction. Once the code determines the length, it stores this length into the first byte of the destination string and then copies the text following the call to the string variable. After copying the string, this code adjusts the return address so that it points just beyond the zero terminating byte. Then the procedure returns control to the caller.

Of course, this string assignment procedure isn't very efficient, but it's very easy to use. Setting up es:di is all that you need to do to use this procedure. If you need fast string assignment, simply use the movs instruction as follows:

; Presumably, DS and ES have already been set up.

                lea     si, SourceString
                lea     di, DestString
                mov     cx, LengthSource
        rep     movsb
                 .
                 .
                 .
SourceString    byte    LengthSource-1
                byte    "This is an example of how the "
                byte    "StrAssign routine is used"
LengthSource    =       $-SourceString 

DestString      byte    256 dup (?)

Using in-line instructions requires considerably more setup (and typing!), but it is much faster than the StrAssign procedure. If you don't like the typing, you can always write a macro to do the string assignment for you.

15.2.3 String Comparison

Comparing two character strings was already beaten to death in the section on the cmps instruction. Other than providing some concrete examples, there is no reason to consider this subject any further.

Note: all the following examples assume that es and ds are pointing at the proper segments containing the destination and source strings.

Comparing Str1 to Str2:

                lea     si, Str1
                lea     di, Str2

; Get the minimum length of the two strings.

                mov     al, Str1
                mov     cl, al
                cmp     al, Str2
                jb      CmpStrs
                mov     cl, Str2

; Compare the two strings.

CmpStrs:        mov     ch, 0
                cld
        repe    cmpsb
                jne     StrsNotEqual

; If CMPS thinks they're equal, compare their lengths 
; just to be sure.

                cmp     al, Str2
StrsNotEqual:

At label StrsNotEqual, the flags will contain all the pertinent information about the ranking of these two strings. You can use the conditional jump instructions to test the result of this comparison.


[5] Reclaiming unused storage.

[6] At least those versions of Pascal which support strings.

Chapter Fifteen (Part 2)

Table of Content

Chapter Fifteen (Part 4) 

Chapter Fifteen: Strings And Character Sets (Part 3)
28 SEP 1996