CHAPTER SIXTEEN: PATTERN MATCHING (Part 6)

The Art of ASSEMBLY LANGUAGE PROGRAMMING

Chapter Sixteen (Part 5)	Table of Content	Chapter Sixteen (Part 7)

CHAPTER SIXTEEN:
PATTERN MATCHING (Part 6)

16.4 - Designing Your Own Pattern Matching Routines
16.5 - Extracting Substrings from Matched Patterns

16.4 Designing Your Own Pattern Matching Routines

Although the UCR Standard Library provides a wide variety of matching functions, there is no way to anticipate the needs of all applications. Therefore, you will probably discover that the library does not support some particular pattern matching function you need. Fortunately, it is very easy for you to create your own pattern matching functions to augment those available in the UCR Standard Library. When you specify a matching function name in the pattern data structure, the match routine calls the specified address using a far call and passing the following parameters:

es:di- Points at the next character in the input string. You should not look at any characters before this address. Furthermore, you should never look beyond the end of the string (see cx below).

ds:si- Contains the four byte parameter found in the matchparm field.

cx- Contains the last position, plus one, in the input string you're allowed to look at. Note that your pattern matching routine should not look beyond location es:cx or the zero terminating byte; whichever comes first in the input string.

On return from the function, ax must contain the offset into the string (di's value) of the last character matched plus one, if your matching function is successful. It must also set the carry flag to denote success. After your pattern matches, the match routine might call another matching function (the one specified by the next pattern field) and that function begins matching at location es:ax.

If the pattern match fails, then you must return the original di value in the ax register and return with the carry flag clear. Note that your matching function must preserve all other registers.

There is one very important detail you must never forget with writing your own pattern matching routines - ds does not point at your data segment, it contains the H.O. word of the matchparm parameter. Therefore, if you are going to access global variables in your data segment you will need to push ds, load it with the address of dseg, and pop ds before leaving. Several examples throughout this chapter demonstrate how to do this.

There are some obvious omissions from (the current version of) the UCR Standard Library's repertoire. For example, there should probably be matchtoistr, matchichar, and matchtoichar pattern functions. The following example code demonstrates how to add a matchtoistr (match up to a string, doing a case insensitive comparison) routine.

                .xlist

                include         stdlib.a
                includelib      stdlib.lib
                matchfuncs
                .list

dseg            segment para public 'data'

TestString      byte    "This is the string 'xyz' in it",cr,lf,0

TestPat         pattern {matchtoistr,xyz}
xyz             byte    "XYZ",0

dseg            ends


cseg            segment para public 'code'
                assume  cs:cseg, ds:dseg

; MatchToiStr-  Matches all characters in a string up to, and including, the
;               specified parameter string. The parameter string must be
;               all upper case characters. This guy matches string using
;               a case insensitive comparison.
;
; inputs:
;               es:di-  Source string
;               ds:si-  String to match
;               cx-     Maximum match position
;
; outputs:
;               ax-     Points at first character beyond the end of the
;                       matched string if success, contains the initial DI
;                       value if failure occurs.
;               carry-  0 if failure, 1 if success.

MatchToiStr     proc    far
                pushf
                push    di
                push    si
                cld

; Check to see if we're already past the point were we're allowed
; to scan in the input string.

                cmp     di, cx
                jae     MTiSFailure

; If the pattern string is the empty string, always match.

                cmp     byte ptr ds:[si], 0
                je      MTSsuccess

; The following loop scans through the input string looking for
; the first character in the pattern string.

ScanLoop:       push    si
                lodsb                   ;Get first char of string

                dec     di
FindFirst:      inc     di              ;Move on to next (or 1st) char.
                cmp     di, cx          ;If at cx, then we've got to
                jae CantFind1st ; fail.

                mov     ah, es:[di]     ;Get input character.
                cmp     ah, 'a'         ;Convert input character to
                jb      DoCmp           ; upper case if it's a lower
                cmp     ah, 'z'         ; case character.
                ja      DoCmp
                and     ah, 5fh
DoCmp:          cmp     al, ah          ;Compare input character against
                jne     FindFirst       ; pattern string.


; At this point, we've located the first character in the input string
; that matches the first character of the pattern string. See if the
; strings are equal.

                push    di              ;Save restart point.

CmpLoop:        cmp     di, cx          ;See if we've gone beyond the
                jae     StrNotThere     ; last position allowable.
                lodsb                   ;Get next input character.
                cmp     al, 0           ;At the end of the parameter
                je      MTSsuccess2     ; string? If so, succeed.

                inc     di
                mov     ah, es:[di]     ;Get the next input character.
                cmp     ah, 'a'         ;Convert input character to
                jb      DoCmp2          ; upper case if it's a lower
                cmp     ah, 'z'         ; case character.
                ja      DoCmp2
                and     ah, 5fh
DoCmp2:         cmp     al, ah          ;Compare input character against
                je      CmpLoop
                pop     di
                pop     si
                jmp     ScanLoop

StrNotThere:    add     sp, 2           ;Remove di from stack.
CantFind1st:    add     sp, 2           ;Remove si from stack.
MTiSFailure:    pop     si
                pop     di
                mov     ax, di          ;Return failure position in AX.
                popf
                clc                     ;Return failure.
                ret

MTSSuccess2:    add     sp, 2           ;Remove DI value from stack.
MTSSuccess:     add     sp, 2           ;Remove SI value from stack.
                mov     ax, di          ;Return next position in AX.
                pop     si
                pop     di
                popf
                stc                     ;Return success.
                ret
MatchToiStr     endp

Main            proc
                mov     ax, dseg
                mov     ds, ax
                mov     es, ax
                meminit

                lesi    TestString
                ldxi    TestPat
                xor     cx, cx
                match
                jnc     NoMatch
                print
                byte    "Matched",cr,lf,0
                jmp     Quit

NoMatch:        print
                byte    "Did not match",cr,lf,0

Quit:           ExitPgm
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             db      1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       db      16 dup (?)
zzzzzzseg       ends
                end     Main

16.5 Extracting Substrings from Matched Patterns

Often, simply determining that a string matches a given pattern is insufficient. You may want to perform various operations that depend upon the actual information in that string. However, the pattern matching facilities described thus far do not provide a mechanism for testing individual components of the input string. In this section, you will see how to extract portions of a pattern for further processing.

Perhaps an example may help clarify the need to extract portions of a string. Suppose you are writing a stock buy/sell program and you want it to process commands described by the following regular expression:

(buy | sell) [0-9]+ shares of (ibm | apple | hp | dec)

While it is easy to devise a Standard Library pattern that recognizes strings of this form, calling the match routine would only tell you that you have a legal buy or sell command. It does not tell you if you are to buy or sell, who to buy or sell, or how many shares to buy or sell. Of course, you could take the cross product of (buy | sell) with (ibm | apple | hp | dec) and generate eight different regular expressions that uniquely determine whether you're buying or selling and whose stock you're trading, but you can't process the integer values this way (unless you willing to have millions of regular expressions). A better solution would be to extract substrings from the legal pattern and process these substrings after you verify that you have a legal buy or sell command. For example, you could extract buy or sell into one string, the digits into another, and the company name into a third. After verifying the syntax of the command, you could process the individual strings you've extracted. The UCR Standard Library patgrab routine provides this capability for you.

You normally call patgrab after calling match and verifying that it matches the input string. Patgrab expects a single parameter - a pointer to a pattern recently processed by match. Patgrab creates a string on the heap consisting of the characters matched by the given pattern and returns a pointer to this string in es:di. Note that patgrab only returns a string associated with a single pattern data structure, not a chain of pattern data structures. Consider the following pattern:

PatToGrab       pattern {matchstr, str1, 0, Pat2}
Pat2            pattern {matchstr, str2}
str1            byte    "Hello",0
str2            byte    " there",0

Calling match on PatToGrab will match the string "Hello there". However, if after calling match you call patgrab and pass it the address of PatToGrab, patgrab will return a pointer to the string "Hello".

Of course, you might want to collect a string that is the concatenation of several strings matched within your pattern (i.e., a portion of the pattern list). This is where calling the sl_match2 pattern matching function comes in handy. Consider the following pattern:

Numbers         pattern {sl_match2, FirstNumber}
FirstNumber     pattern {anycset, digits, 0, OtherDigs}
OtherDigs       pattern {spancset, digits}

This pattern matches the same strings as

Numbers         pattern {anycset, digits, 0, OtherDigs}
OtherDigs       pattern {spancset, digits}

So why bother with the extra pattern that calls sl_match2? Well, as it turns out the sl_match2 matching function lets you create parenthetical patterns. A parenthetical pattern is a pattern list that the pattern matching routines (especially patgrab) treat as a single pattern. Although the match routine will match the same strings regardless of which version of Numbers you use, patgrab will produce two entirely different strings depending upon your choice of the above patterns. If you use the latter version, patgrab will only return the first digit of the number. If you use the former version (with the call to sl_match2), then patgrab returns the entire string matched by sl_match2, and that turns out to be the entire string of digits.

The following sample program demonstrates how to use parenthetical patterns to extract the pertinent information from the stock command presented earlier. It uses parenthetical patterns for the buy/sell command, the number of shares, and the company name.

                .xlist
                include                 stdlib.a
                includelib              stdlib.lib
                matchfuncs
                .list

dseg            segment para public 'data'

; Variables used to hold the number of shares bought/sold, a pointer to
; a string containing the buy/sell command, and a pointer to a string
; containing the company name.

Count           word    0
CmdPtr          dword   ?
CompPtr         dword   ?

; Some test strings to try out:

Cmd1            byte    "Buy 25 shares of apple stock",0
Cmd2            byte    "Sell 50 shares of hp stock",0
Cmd3            byte    "Buy 123 shares of dec stock",0
Cmd4            byte    "Sell 15 shares of ibm stock",0
BadCmd0         byte    "This is not a buy/sell command",0

; Patterns for the stock buy/sell command:
;
; StkCmd matches buy or sell and creates a parenthetical pattern
; that contains the string "buy" or "sell".

StkCmd          pattern {sl_match2, buyPat, 0, skipspcs1}

buyPat          pattern {matchistr,buystr,sellpat}
buystr          byte    "BUY",0

sellpat         pattern {matchistr,sellstr}
sellstr         byte    "SELL",0

; Skip zero or more white space characters after the buy command.

skipspcs1       pattern {spancset, whitespace, 0, CountPat}

; CountPat is a parenthetical pattern that matches one or more
; digits.

CountPat        pattern {sl_match2, Numbers, 0, skipspcs2}
Numbers         pattern {anycset, digits, 0, RestOfNum}
RestOfNum       pattern {spancset, digits}

; The following patterns match " shares of " allowing any amount
; of white space between the words.

skipspcs2       pattern {spancset, whitespace, 0, sharesPat}

sharesPat       pattern {matchistr, sharesStr, 0, skipspcs3}
sharesStr       byte    "SHARES",0

skipspcs3       pattern {spancset, whitespace, 0, ofPat}

ofPat           pattern {matchistr, ofStr, 0, skipspcs4}
ofStr           byte    "OF",0

skipspcs4       pattern {spancset, whitespace, 0, CompanyPat}

; The following parenthetical pattern matches a company name.
; The patgrab-available string will contain the corporate name.

CompanyPat      pattern {sl_match2, ibmpat}

ibmpat          pattern {matchistr, ibm, applePat}
ibm             byte    "IBM",0

applePat        pattern {matchistr, apple, hpPat}
apple           byte    "APPLE",0

hpPat           pattern {matchistr, hp, decPat}
hp              byte    "HP",0

decPat          pattern {matchistr, decstr}
decstr          byte    "DEC",0

                include stdsets.a
dseg            ends

cseg            segment para public 'code'
                assume  cs:cseg, ds:dseg

; DoBuySell-    This routine processes a stock buy/sell command.
;               After matching the command, it grabs the components
;               of the command and outputs them as appropriate.
;               This routine demonstrates how to use patgrab to
;               extract substrings from a pattern string.
;
;               On entry, es:di must point at the buy/sell command
;               you want to process.

DoBuySell       proc    near
                ldxi StkCmd
                xor     cx, cx
                match
                jnc     NoMatch

                lesi    StkCmd
                patgrab
                mov     word ptr CmdPtr, di
                mov     word ptr CmdPtr+2, es

                lesi    CountPat
                patgrab
                atoi                    ;Convert digits to integer
                mov     Count, ax
                free                    ;Return storage to heap.

                lesi    CompanyPat
                patgrab
                mov     word ptr CompPtr, di
                mov     word ptr CompPtr+2, es

                printf
                byte    "Stock command: %^s\n"
                byte    "Number of shares: %d\n"
                byte    "Company to trade: %^s\n\n",0
                dword   CmdPtr, Count, CompPtr

                les     di, CmdPtr
                free
                les     di, CompPtr
                free
                ret

NoMatch:                print
                byte    "Illegal buy/sell command",cr,lf,0
                ret
DoBuySell       endp


Main            proc
                mov     ax, dseg
                mov     ds, ax
                mov     es, ax

                meminit

                lesi    Cmd1
                call    DoBuySell
                lesi    Cmd2
                call    DoBuySell
                lesi    Cmd3
                call    DoBuySell
                lesi    Cmd4
                call    DoBuySell
                lesi    BadCmd0
                call    DoBuySell

Quit:           ExitPgm
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             db      1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       db      16 dup (?)
zzzzzzseg       ends
                end     Main


Sample program output:
Stock command: Buy
Number of shares: 25
Company to trade: apple

Stock command: Sell
Number of shares: 50
Company to trade: hp

Stock command: Buy
Number of shares: 123
Company to trade: dec

Stock command: Sell
Number of shares: 15
Company to trade: ibm

Illegal buy/sell command


Chapter Sixteen (Part 5)	Table of Content	Chapter Sixteen (Part 7)

Chapter Sixteen: Pattern Matching (Part 6)
29 SEP 1996