CountStr and ChangeStr for Classic REXX

Next Meeting: Sat, TBD
Meeting Directions

Be a Member
Join SCOUG

Navigation:

20 Most Recent Documents
Search Archives
Index by date, title, author, category.

Features:

Mr. Know-It-All
Ink
Download!

Supporting Warpstock Little Rock 2025

Supporting Warpstock Phoenix 2023

Supporting Warpstock Orlando 2022

SCOUG:

Home

Email Lists

SIGs (Internet, General Interest, Programming, Network, more..)

Online Chats

Pictures from Sept. 1999

The views expressed in articles on this site are those of their authors.

warptech
SCOUG was there!

SCOUG, Warp Expo West, and Warpfest are trademarks of the Southern California OS/2 User Group. OS/2, Workplace Shell, and IBM are registered trademarks of International Business Machines Corporation. All other trademarks remain the property of their respective owners.

The Southern California OS/2 User Group
USA

February 2002

CountStr and ChangeStr for Classic REXX

by Dallas Legan

Digging around in the documentation for the Regina (a GPL licensed REXX interpreter), I ran across reference to two functions described as part of the ANSI standard for REXX that I could find no documentation for on my Warp system or in the late Dick Goran's classic 'REXX Reference Summary Handbook' (ISBN 0-9639854-3-4). Browsing the REXX Language Association web site, I found a PDF draft of the standard that backed up the description in the Regina documentation. CountStr counts all non-overlapping instances of a substring (the "needle") in a larger string of characters (the "haystack"). For instance,

say CountStr( 'aa', 'aabcdaaaaefgaa' ) should print out '4', since there are four non-overlapping instances of needle string 'aa' in the haystack string 'aabcdaaaaefgaa' (and not '5').

Similarly, ChangeStr changes those non-overlapping substrings to a target string:

say ChangeStr( 'zz', 'abzzcdzzefzzghzzijzzkl', 'yy' ) would return 'abyycdyyefyyghyyijyykl'.

A few questions on mailing lists and browsing Object REXX books turned up that these functions had been implemented as methods in Object REXX and some suggestions that they might be implemented with various REXX standard library functions such as Pos etc. This seemed like a good project to chew over some and see what I could come up with. The mention of the standard library function Pos seemed to miss the mark to me. It implied using numbers (and not particularly large ones at that), to keep track of location in the 'haystack' string. This was therefore going straight to probably the weakest point in REXX, speed of handling modestly sized numbers.

One of the intellectual highlights of my life was at the 2000 11th Annual International REXX Symposium, sitting beside REXX creator Mike Cowlishaw, hearing his comments while REXX L.A. president Chip Davis gave his presentation on the Parse command. My goal was simply to listen and absorb as much as possible. The most important point of Davis's talk was that Parse should be used as much as possible. A little bit of fumbling around turned up this possible implementation of CountStr:

CountStr: Procedure ; PARSE ARG needle, haystack =1 fpart (needle) lpart ; SELECT WHEN fpart == haystack THEN /* No needle found */ RETURN 0 ; WHEN '' == lpart THEN /* Needle at very end of Haystack */ RETURN 1 ; OTHERWISE RETURN 1 + CountStr( needle, lpart, switches ) ; /* All other cases */ END /* SELECT */ SAY 'Something is drasticly wrong in CountStr()' ; EXIT ; /* CountStr() */

This was good, since it didn't use any other function calls, with plenty usage of Parse, and the recursion (a function calling itself) gives a nice textbook exercise quality to it. Successful use of recursion always seems to produce a feeling of being levitated after using it, because of the mystique it has been imbued with, but it really shouldn't. (I even got inspired to re-read Douglas Hofstadter's "Goedel, Escher, Bach: An Eternal Golden Braid", which sparked the literary genre of 'Computer Aided Scholasticism.')

Recursion works on many problems because it goes straight to the heart of the engineering method of solving problems: simplifying things until you get a problem you can solve, and then step-by-step dealing with the reintroduction of complications 'til you return to the problem you originally were trying to solve. I remember reading about this in the introduction to a friends copy of Charles Proteus Steinmetz's historically important books on electrical engineering. Obviously then, some idea along these lines predates the conscious concept of modern 'computer science.'

The distinguishing point in programming is that recursion can be used to either simplify the problem data, as above (each recursive call to ChangeStr is guaranteed to have one fewer case of the needle string), or in the classic Quick Sort algorithm, simplifying the algorithm itself. (Rather than trying to do a complete, comprehensive sort of the data in one subroutine, can you sort it into two piles of data that can then be recursively sorted further until the desired result pops out?) All of this is almost a direct embodiment of the concept. Fortunately, the data structures usually dealt with in Computer Science are usually not continuous, or the act of a function calling itself in the hope of simplifying things would be tantamount to driving off a cliff. The number of points in a half-inch line seems to be the same as the number of points in a one-inch line, and trying to simplify dealing with each of the points in a line by dealing with a shorter segment of a line is, as far as we know, futile. You may see reference to 'partially recursive' functions, which go through the motion of calling themselves, but make no testable simplification of the problem with each call, and may not be guaranteed of solving the problem.

There is a concept which corrupts and upsets all others. I refer not to Evil, whose limited realm is that of ethics; I refer to the infinite. - Jorge Luis Borges, essay "The Avatars of the Tortoise."

While conceptually elegant, recursion can frequently be more efficiently done if the solution can be restated as an iterative loop of some kind. Besides getting rid of function call overhead, some more subtle advantages are that simple iteration avoids the drag of having to process any switches with every recursive call, or using a wrapper function to handle the switches, (still incurring another function call). Nonetheless, the fact that the problem could be done recursively provided some hope for a simple iterative solution with few function calls, and after some more fumbling around this came up:

CountStr: Procedure ; PARSE ARG needle, haystack ; /* this might fail in a pathological case: */ /* guard = D2C( Verify( XRange(), haystack || needle, 'NOMATCH' ) - 1 ) ; haystack = haystack || guard ; */ /* This case doesn't fail in the event of a haystack with all possible byte/characters as elements, and forces under controlled circumstances the problem case of needle being at the end of haystack, where it doesn't get counted. Now, that doesn't matter, since it is an extra needle that shouldn't be counted anyway: */ haystack = haystack || needle ; DO I = 0 UNTIL '' == haystack PARSE VAR haystack . (needle) haystack ; END I RETURN I ;

As the comments indicate, problems cropped up with the case of needle being exactly at the end of haystack, a last pass through the loop necessary for a correct count would be missed. Just putting any special string at the end of haystack to indicate to start special treatment because you are at the end of the haystack string opened up a can worms of possible problem cases.

A simple solution to this was to put a bogus extra case of needle on the end, which never gets counted as the DO loop comes to the problem case. In effect, this directly confronts the problem under control, and contrived circumstances, rather than waiting for it to come up with the random use of the function. One thing I think about this solution is that, to me anyway, it seems simpler than even the recursive version - it doesn't have the lines of code consumed with the SELECT statement. The DO loop compactly handles counting, all the logic of when to stop as well as the iteration for each needle string.

While going through all this, parallel development went on using the algorithms for ChangeStr and with adding all sorts of features not in the standard. Things got as far as submission to the editor for this web page, and was bounced back for correcting a confusing typo. Checking it over, something didn't seem quit right, and looking at the regression tests, a problem cropped up.

What happens if the haystack string ends, in say '.....ba', and the needle is 'aa'? Or generalizing a bit, needle is 'baba'? After I thought about this a bit, I realized that the needle string could be in a periodic string built up from smaller substrings. So with a trailing substring (I'll call it a 'cell'), the haystack ends in cell fragment of the needle, and a false finding of the needle would show up. The false needle straddles the true end of the haystack. I finaly recognized these repetative (perhaps even recursive?) strings as character sequences produced by the 'quinning' process discused in Hofstadter's book, named after American philosopher/logician/cryptologyst Willard van Orman Quine, who first brought attention to them. In this case there didn't seem to be any semantic (meaning) importance as in 'Godel, Escher, Bach:...', just an annoying problem in my attempt to count substrings. (This problem does have some vague resemblence to crystal dissolution/solidification processes, where the border between haystack/needle becomes indistinct.)

Somewhat discouraging to find a problem at this point, the flip side was that finding such a bizarre problem suggests that it was being analyzed carefully. After all, with prior knowledge of the input data under some circumstances, the simpler initial methods might work fine for some uses.

Attempts to solve this problem by some kind of strict comparison of the haystack to needle, trying to stop when the haystack is finally 'strictly' less than needles seemed hopeful at first, (DO I = 0 UNTIL haystack << needle ) but was found to be a false hope because the peculiarities of how strict comparison was carried out. It depends too much on the characters in the strings and not enough on the lengths of the strings. Another approach that worked solidly, but made too many function calls for my taste was:

.... Lneedle = Length( needle ) ; DO I = 0 UNTIL Length( haystack ) << lneedle ....

After a nights sleep on this, I decided to bite the bullet, and allow one function call but avoid having one on each DO loop iteration. This is done by testing if a needle was on the end of the haystack, and only then appending a needle to avoid the missed needle syndrome:

CountStr: Procedure ; PARSE ARG needle, haystack ; ltest = Length( needle ) ; PARSE VAR haystack . '' -(ltest) qtest ; /* '.' = dummy variable for PARSE, '' is interpretted by PARSE to be past the end of the string being parsed. -(ltest) backs-up 'ltest' characters from the end of haystack to put the last 'ltest' characters in the 'qtest' variable. */ IF qtest == needle THEN haystack = haystack || needle ; DO I = 0 UNTIL '' == haystack PARSE VAR haystack . (needle) haystack ; END I RETURN I ;

With this version, I threw out all previous functions for ChangeStr() etc., and rewrote them to simply count the number of needle strings in haystack, and use this information as the basis of decisions on the handling of input switch information and parse results.

A finally satisfactory version

After considerable further thought, the following version was dreamed up:

CountStr: Procedure ; PARSE UPPER ARG needle, haystack, . 'I' +0 ignorecase , . ; ignorecase = '' << ignorecase ; /* Booleanate the value */ usecase = \ ignorecase ; IF usecase THEN PARSE ARG needle, haystack, . ; IF '' == needle THEN RETURN 0 ; /* Parse interpets a null string to be past the end of whatever string is being parsed. */ DO total = 0 BY 1 UNTIL '' == haystack PARSE VAR haystack oldhaystack =1 deltahaystack (needle) haystack ; END total /* Using the information of what the haystack was, and what would be thrown away with each iteration, the variables oldhaystack and deltahaystack allow determination of whether a needle was exactly on the end of haystack or not, which adds no extra steps to the DO loop, only a more complex PARSE statement, and adds an IF/ELSE to the return. This also eliminates one function call. It seems to average out to slightly faster than checking for a needle on the end beforehand, and then calling the recursive version if there is. Also, in cleans things up by needing no recursive version as support. One criticism of PARSE is that it never tells directly if the data being parsed was split up or not, if the delimiter strings were found in the data. ('Programming in Rexx', Charles Daney, McGraw-Hill, Inc., (C) 1992, ISBN 0-07-015305-1, p. 99-100, 124, 162) This method demonstrates one method of checking the results to determine whether or not the targeted data was parsed, by saving all possible results and checking afterward. */ IF oldhaystack >> deltahaystack THEN RETURN total + 1 ; /* There *was* a needle on the end of the haystack. Which would of been uncounted by the DO loop part of this function */ RETURN total ; /* There was no needle on the end of the haystack. */ /* End of CountStr */

It brings out a few points that should be made.

While the Parse command doesn't provide any kind of return code to indicate what happened, you can always save the original string being parsed, and all the parts it is broken up into, to determine what happened. This will generally mean not throwing anything away in '.', the placeholder dummy variable.
By using '=1' to 'rewind' the parse process to the start of the string, to make another pass over it, making multiple assignments, you can save the original string without having to add another statement.
One thing reusable in converting some code from recursion to interation is the test for completion. It's got to be there, in some form, and sticking with it you keep close to the original algorithm. The recursion may hide the need for some variables (which are stored on the stack or whatever in the chain of function calls), but this is simply something to keep alert to.

Of course, besides making it work right and reasonably efficiently, the other goal was to add as many bells and whistles as possible! Typically these include:

Ignoring case
For ChangeStr:
Only doing a limited count of changes
Counting the number of changes from the end (Perl/Python style)
Only changing one instance of needle inside haystack
Making changes within a range on old-needle instances
etc.

These functions are obviously useful for writing filter programs and such, so to me seemed worthwhile spending time on. Especially when adding all the extra features. There are probably better ways to do a lot of this and I'm certain some people will think they should be written in C and compiled. That's fine. Meanwhile, here's a file of several prototypes for you to work from!

The Southern California OS/2 User Group
P.O. Box 26904
Santa Ana, CA 92799-6904, USA