Chapter Sixteen (Part 5) |
Table of Content | Chapter Sixteen (Part 7) |
CHAPTER SIXTEEN: PATTERN MATCHING (Part 6) |
16.4 -
Designing Your Own Pattern Matching Routines 16.5 - Extracting Substrings from Matched Patterns |
16.4 Designing Your Own Pattern Matching Routines |
Although the UCR Standard Library provides a wide variety of matching functions, there is no way to anticipate the needs of all applications. Therefore, you will probably discover that the library does not support some particular pattern matching function you need. Fortunately, it is very easy for you to create your own pattern matching functions to augment those available in the UCR Standard Library. When you specify a matching function name in the pattern data structure, the match routine calls the specified address using a far call and passing the following parameters:
es:di
- Points at the next character in the
input string. You should not look at any characters before this address. Furthermore, you
should never look beyond the end of the string (see cx
below).
ds:si
- Contains the four byte parameter found
in the matchparm
field.
cx
- Contains the last position, plus one, in
the input string you're allowed to look at. Note that your pattern matching routine should
not look beyond location es:cx
or the zero terminating byte; whichever comes
first in the input string.
On return from the function, ax
must contain
the offset into the string (di
's value) of the last character matched plus
one, if your matching function is successful. It must also set the carry flag to denote
success. After your pattern matches, the match routine might call another matching
function (the one specified by the next pattern field) and that function begins matching
at location es:ax
.
If the pattern match fails, then you must return the
original di
value in the ax
register and return with the carry
flag clear. Note that your matching function must preserve all other registers.
There is one very important detail you must never forget
with writing your own pattern matching routines - ds
does not point at your
data segment, it contains the H.O. word of the matchparm
parameter.
Therefore, if you are going to access global variables in your data segment you will need
to push ds
, load it with the address of dseg
, and pop ds
before leaving. Several examples throughout this chapter demonstrate how to do this.
There are some obvious omissions from (the current version
of) the UCR Standard Library's repertoire. For example, there should probably be matchtoistr
,
matchichar
, and matchtoichar
pattern functions. The following
example code demonstrates how to add a matchtoistr
(match up to a string,
doing a case insensitive comparison) routine.
.xlist include stdlib.a includelib stdlib.lib matchfuncs .list dseg segment para public 'data' TestString byte "This is the string 'xyz' in it",cr,lf,0 TestPat pattern {matchtoistr,xyz} xyz byte "XYZ",0 dseg ends cseg segment para public 'code' assume cs:cseg, ds:dseg ; MatchToiStr- Matches all characters in a string up to, and including, the ; specified parameter string. The parameter string must be ; all upper case characters. This guy matches string using ; a case insensitive comparison. ; ; inputs: ; es:di- Source string ; ds:si- String to match ; cx- Maximum match position ; ; outputs: ; ax- Points at first character beyond the end of the ; matched string if success, contains the initial DI ; value if failure occurs. ; carry- 0 if failure, 1 if success. MatchToiStr proc far pushf push di push si cld ; Check to see if we're already past the point were we're allowed ; to scan in the input string. cmp di, cx jae MTiSFailure ; If the pattern string is the empty string, always match. cmp byte ptr ds:[si], 0 je MTSsuccess ; The following loop scans through the input string looking for ; the first character in the pattern string. ScanLoop: push si lodsb ;Get first char of string dec di FindFirst: inc di ;Move on to next (or 1st) char. cmp di, cx ;If at cx, then we've got to jae CantFind1st ; fail. mov ah, es:[di] ;Get input character. cmp ah, 'a' ;Convert input character to jb DoCmp ; upper case if it's a lower cmp ah, 'z' ; case character. ja DoCmp and ah, 5fh DoCmp: cmp al, ah ;Compare input character against jne FindFirst ; pattern string. ; At this point, we've located the first character in the input string ; that matches the first character of the pattern string. See if the ; strings are equal. push di ;Save restart point. CmpLoop: cmp di, cx ;See if we've gone beyond the jae StrNotThere ; last position allowable. lodsb ;Get next input character. cmp al, 0 ;At the end of the parameter je MTSsuccess2 ; string? If so, succeed. inc di mov ah, es:[di] ;Get the next input character. cmp ah, 'a' ;Convert input character to jb DoCmp2 ; upper case if it's a lower cmp ah, 'z' ; case character. ja DoCmp2 and ah, 5fh DoCmp2: cmp al, ah ;Compare input character against je CmpLoop pop di pop si jmp ScanLoop StrNotThere: add sp, 2 ;Remove di from stack. CantFind1st: add sp, 2 ;Remove si from stack. MTiSFailure: pop si pop di mov ax, di ;Return failure position in AX. popf clc ;Return failure. ret MTSSuccess2: add sp, 2 ;Remove DI value from stack. MTSSuccess: add sp, 2 ;Remove SI value from stack. mov ax, di ;Return next position in AX. pop si pop di popf stc ;Return success. ret MatchToiStr endp Main proc mov ax, dseg mov ds, ax mov es, ax meminit lesi TestString ldxi TestPat xor cx, cx match jnc NoMatch print byte "Matched",cr,lf,0 jmp Quit NoMatch: print byte "Did not match",cr,lf,0 Quit: ExitPgm Main endp cseg ends sseg segment para stack 'stack' stk db 1024 dup ("stack ") sseg ends zzzzzzseg segment para public 'zzzzzz' LastBytes db 16 dup (?) zzzzzzseg ends end Main
Often, simply determining that a string matches a given pattern is insufficient. You may want to perform various operations that depend upon the actual information in that string. However, the pattern matching facilities described thus far do not provide a mechanism for testing individual components of the input string. In this section, you will see how to extract portions of a pattern for further processing.
Perhaps an example may help clarify the need to extract portions of a string. Suppose you are writing a stock buy/sell program and you want it to process commands described by the following regular expression:
(buy | sell) [0-9]+ shares of (ibm | apple | hp | dec)
While it is easy to devise a Standard Library pattern that
recognizes strings of this form, calling the match
routine would only tell
you that you have a legal buy or sell command. It does not tell you if you are to buy or
sell, who to buy or sell, or how many shares to buy or sell. Of course, you could take the
cross product of (buy | sell) with (ibm | apple | hp | dec) and generate eight different
regular expressions that uniquely determine whether you're buying or selling and whose
stock you're trading, but you can't process the integer values this way (unless you
willing to have millions of regular expressions). A better solution would be to extract
substrings from the legal pattern and process these substrings after you verify that you
have a legal buy or sell command. For example, you could extract buy or sell into one
string, the digits into another, and the company name into a third. After verifying the
syntax of the command, you could process the individual strings you've extracted. The UCR
Standard Library patgrab
routine provides this capability for you.
You normally call patgrab
after calling match
and verifying that it matches the input string. Patgrab
expects a single
parameter - a pointer to a pattern recently processed by match. Patgrab
creates a string on the heap consisting of the characters matched by the given pattern and
returns a pointer to this string in es:di
. Note that patgrab
only returns a string associated with a single pattern data structure, not a chain of
pattern data structures. Consider the following pattern:
PatToGrab pattern {matchstr, str1, 0, Pat2} Pat2 pattern {matchstr, str2} str1 byte "Hello",0 str2 byte " there",0
Calling match
on PatToGrab
will
match the string "Hello there". However, if after calling match
you
call patgrab
and pass it the address of PatToGrab
, patgrab
will return a pointer to the string "Hello".
Of course, you might want to collect a string that is the
concatenation of several strings matched within your pattern (i.e., a portion of the
pattern list). This is where calling the sl_match2
pattern matching function
comes in handy. Consider the following pattern:
Numbers pattern {sl_match2, FirstNumber} FirstNumber pattern {anycset, digits, 0, OtherDigs} OtherDigs pattern {spancset, digits}
This pattern matches the same strings as
Numbers pattern {anycset, digits, 0, OtherDigs} OtherDigs pattern {spancset, digits}
So why bother with the extra pattern that calls sl_match2
?
Well, as it turns out the sl_match2
matching function lets you create
parenthetical patterns. A parenthetical pattern is a pattern list that the pattern
matching routines (especially patgrab
) treat as a single pattern. Although
the match
routine will match the same strings regardless of which version of Numbers
you use, patgrab
will produce two entirely different strings depending upon
your choice of the above patterns. If you use the latter version, patgrab
will only return the first digit of the number. If you use the former version (with the
call to sl_match2
), then patgrab
returns the entire string
matched by sl_match2
, and that turns out to be the entire string of digits.
The following sample program demonstrates how to use parenthetical patterns to extract the pertinent information from the stock command presented earlier. It uses parenthetical patterns for the buy/sell command, the number of shares, and the company name.
.xlist include stdlib.a includelib stdlib.lib matchfuncs .list dseg segment para public 'data' ; Variables used to hold the number of shares bought/sold, a pointer to ; a string containing the buy/sell command, and a pointer to a string ; containing the company name. Count word 0 CmdPtr dword ? CompPtr dword ? ; Some test strings to try out: Cmd1 byte "Buy 25 shares of apple stock",0 Cmd2 byte "Sell 50 shares of hp stock",0 Cmd3 byte "Buy 123 shares of dec stock",0 Cmd4 byte "Sell 15 shares of ibm stock",0 BadCmd0 byte "This is not a buy/sell command",0 ; Patterns for the stock buy/sell command: ; ; StkCmd matches buy or sell and creates a parenthetical pattern ; that contains the string "buy" or "sell". StkCmd pattern {sl_match2, buyPat, 0, skipspcs1} buyPat pattern {matchistr,buystr,sellpat} buystr byte "BUY",0 sellpat pattern {matchistr,sellstr} sellstr byte "SELL",0 ; Skip zero or more white space characters after the buy command. skipspcs1 pattern {spancset, whitespace, 0, CountPat} ; CountPat is a parenthetical pattern that matches one or more ; digits. CountPat pattern {sl_match2, Numbers, 0, skipspcs2} Numbers pattern {anycset, digits, 0, RestOfNum} RestOfNum pattern {spancset, digits} ; The following patterns match " shares of " allowing any amount ; of white space between the words. skipspcs2 pattern {spancset, whitespace, 0, sharesPat} sharesPat pattern {matchistr, sharesStr, 0, skipspcs3} sharesStr byte "SHARES",0 skipspcs3 pattern {spancset, whitespace, 0, ofPat} ofPat pattern {matchistr, ofStr, 0, skipspcs4} ofStr byte "OF",0 skipspcs4 pattern {spancset, whitespace, 0, CompanyPat} ; The following parenthetical pattern matches a company name. ; The patgrab-available string will contain the corporate name. CompanyPat pattern {sl_match2, ibmpat} ibmpat pattern {matchistr, ibm, applePat} ibm byte "IBM",0 applePat pattern {matchistr, apple, hpPat} apple byte "APPLE",0 hpPat pattern {matchistr, hp, decPat} hp byte "HP",0 decPat pattern {matchistr, decstr} decstr byte "DEC",0 include stdsets.a dseg ends cseg segment para public 'code' assume cs:cseg, ds:dseg ; DoBuySell- This routine processes a stock buy/sell command. ; After matching the command, it grabs the components ; of the command and outputs them as appropriate. ; This routine demonstrates how to use patgrab to ; extract substrings from a pattern string. ; ; On entry, es:di must point at the buy/sell command ; you want to process. DoBuySell proc near ldxi StkCmd xor cx, cx match jnc NoMatch lesi StkCmd patgrab mov word ptr CmdPtr, di mov word ptr CmdPtr+2, es lesi CountPat patgrab atoi ;Convert digits to integer mov Count, ax free ;Return storage to heap. lesi CompanyPat patgrab mov word ptr CompPtr, di mov word ptr CompPtr+2, es printf byte "Stock command: %^s\n" byte "Number of shares: %d\n" byte "Company to trade: %^s\n\n",0 dword CmdPtr, Count, CompPtr les di, CmdPtr free les di, CompPtr free ret NoMatch: print byte "Illegal buy/sell command",cr,lf,0 ret DoBuySell endp Main proc mov ax, dseg mov ds, ax mov es, ax meminit lesi Cmd1 call DoBuySell lesi Cmd2 call DoBuySell lesi Cmd3 call DoBuySell lesi Cmd4 call DoBuySell lesi BadCmd0 call DoBuySell Quit: ExitPgm Main endp cseg ends sseg segment para stack 'stack' stk db 1024 dup ("stack ") sseg ends zzzzzzseg segment para public 'zzzzzz' LastBytes db 16 dup (?) zzzzzzseg ends end Main Sample program output: Stock command: Buy Number of shares: 25 Company to trade: apple Stock command: Sell Number of shares: 50 Company to trade: hp Stock command: Buy Number of shares: 123 Company to trade: dec Stock command: Sell Number of shares: 15 Company to trade: ibm Illegal buy/sell command
Chapter Sixteen (Part 5) |
Table of Content | Chapter Sixteen (Part 7) |
Chapter Sixteen: Pattern Matching
(Part 6)
29 SEP 1996