Thursday, January 15, 2015

Experimenting with string manipulation

Division using my MathLibrary.cmd script has always been slow. Really, really slow:
/01 OK 11:21:39.08 11:21:37.99 0:0:1.9
/02 OK 11:21:40.89 11:21:39.09 0:0:1.80
/03 OK 11:21:43.92 11:21:40.90 0:0:3.2
/04 OK 11:21:48.57 11:21:43.93 0:0:4.64
/05 OK 11:21:54.78 11:21:48.58 0:0:6.20
/06 OK 11:22:12.96 11:21:54.79 0:0:18.17
/07 OK 11:22:53.36 11:22:12.98 0:0:40.38
/08 OK 11:28:33.39 11:22:53.37 0:5:40.2
/09 OK 11:50:13.04 11:28:33.40 0:21:39.64
/10 OK 12:12:14.42 11:50:13.05 0:22:1.37
The bolded numbers are how long each unit test (/01, /02, etc.) took to complete in HH:MM:SS.mS format. Each test is a series of identical numbers (meaning the answer should always be "1") in increasing size - test /01, for example, is 2/2, while test /10 - which took over 22 minutes - is dividing the same 101-digit number by itself.

This is a problem.

It's a problem because, at the end of the day, the goal of this script is to solve the third Project Euler problem - find the largest prime factor of a large enough number for CMD to choke on it - and in order to do that, I need to be able to calculate the square root of that number so I have an upper bound of prime factors to consider. In order to determine a square root by scratch, however, the most straightforward method that I can wrap my mind around uses iterative floating point division to get closer and closer to the square root until it's found. Obviously, an algorithm that takes over 20 minutes to divide two numbers isn't going to work if I have to divide two numbers over and over again.

While running my unit tests, however, I discovered something curious:
 Testing subtraction operations...
-01 OK 10:55:41.34 10:55:41.16 0:0:0.18
-02 OK 10:55:41.82 10:55:41.35 0:0:0.47
-03 OK 10:55:43.21 10:55:41.82 0:0:1.39
-04 OK 10:55:46.00 10:55:43.22 0:0:2.78
-05 OK 10:55:59.50 10:55:46.02 0:0:13.48
-06 OK 10:55:59.61 10:55:59.52 0:0:0.9
-07 OK 10:55:59.88 10:55:59.63 0:0:0.25
-08 OK 10:56:00.93 10:55:59.88 0:0:1.5
-09 OK 10:56:02.84 10:56:00.94 0:0:1.90
-10 OK 10:56:05.72 10:56:02.84 0:0:2.88
-11 OK 10:56:05.82 10:56:05.72 0:0:0.10
-12 OK 10:56:06.07 10:56:05.82 0:0:0.25
-13 OK 10:56:07.13 10:56:06.09 0:0:1.4
-14 OK 10:56:09.04 10:56:07.13 0:0:1.91
-15 OK 10:56:11.93 10:56:09.04 0:0:2.89
-16 NO 10:56:12.05 10:56:11.93 0:0:0.12
-17 NO 10:56:12.24 10:56:12.05 0:0:0.19
-18 NO 10:56:13.56 10:56:12.25 0:0:1.31
-19 NO 10:56:16.14 10:56:13.58 0:0:2.56
-20 NO 10:56:27.52 10:56:16.16 0:0:11.36
Note the "NO" results on the last five tests - these are tests that subtract two identical floating point numbers, which should return 0. Well, it turns out it wasn't quite doing that...

C:\Users\ \Documents\GitHub\MathLibrary.cmd>MathLibrary.cmd 0.2 - 0.2
0.0

C:\Users\ \Documents\GitHub\MathLibrary.cmd>MathLibrary.cmd 14.320 - 14.320
0.000


Notice that the results - bolded for readability - aren't actually "0" - they're 0.however many significant digits were on the right of the number. This led me to put together a quick "strip trailing zeroes" routine, which is near-identical to the one used by the division subroutine:

    :ExtSubtractStripTrailingZeroes
        CALL :ExtDim %_SubtractResult% _SubtractResLen _SubtractResDec
        IF %_SubtractResDec% GTR %_SubtractResLen% GOTO ExtSubtractDoneStrippingTrailingZeroes
        SET _SubtractRInt=%_SubtractResult:~-1%
        IF %_SubtractRInt%==. (
            SET _SubtractResult=%_SubtractResult:~0,-1%
            GOTO ExtSubtractDoneStrippingTrailingZeroes
        )
        IF NOT %_SubtractRInt%==0 GOTO ExtSubtractDoneStrippingTrailingZeroes
        SET _SubtractResult=%_SubtractResult:~0,-1%
        GOTO ExtSubtractStripTrailingZeroes
    :ExtSubtractDoneStrippingTrailingZeroes


Then I re-ran my tests:
-16 OK 13:45:15.31 13:45:15.15 0:0:0.16
-17 OK 13:45:15.57 13:45:15.31 0:0:0.26
-18 OK 13:45:18.60 13:45:15.59 0:0:3.1
-19 OK 13:45:25.90 13:45:18.61 0:0:7.29
-20 OK 13:51:18.70 13:45:25.90 0:5:52.80
The good news is they all passed. The bad news is they started taking longer - much longer in the case of the last test, which went from taking an already slow 11 seconds to taking close to 6 minutes. Something needed to be done. So, instead of looping via GOTO and truncating the string one digit at a time, I decided I'd try finding the position of the first trailing zero and then cut all of them at once:

    :ExtSubtractStripTrailingZeroes
        CALL :ExtDim %_SubtractResult% _SubtractResLen _SubtractResDec
        IF %_SubtractResDec% GTR %_SubtractResLen% GOTO ExtSubtractDoneStrippingTrailingZeroes
        SET /A _SubStripMax=_SubtractResLen-_SubtractResDec+1
        SET _SubStripOff=0
        FOR /L %%G IN (1,1,%_SubStripMax%) DO (
            SET _SubStripPos=%%G
            CALL SET _SubZChk=%%_SubtractResult:~-!_SubStripPos!,%_One%%%
            IF NOT !_SubZChk!==0 (
                GOTO ExtSubtractFoundTrailingZeroes
            ) ELSE SET _SubStripOff=%%G
        )
    :ExtSubtractFoundTrailingZeroes
        IF %_SubZChk%==. SET /A _SubStripOff+=1
        CALL SET _SubtractResult=%%_SubtractResult:~%_Zero%,-%_SubStripOff%%%
    :ExtSubtractDoneStrippingTrailingZeroes

The result?
-16 OK 15:12:51.19 15:12:51.05 0:0:0.14
-17 OK 15:12:51.41 15:12:51.19 0:0:0.22
-18 OK 15:12:52.83 15:12:51.42 0:0:1.41
-19 OK 15:12:55.59 15:12:52.84 0:0:2.75
-20 OK 15:13:08.87 15:12:55.61 0:0:13.26
Much better.  Granted, it's a little slower than it was when I wasn't cutting the zeroes out at all, but this was at least manageable. Then I took a second look at the first line of code from the original algorithm:

    :ExtSubtractStripTrailingZeroes
        CALL :ExtDim %_SubtractResult% _SubtractResLen _SubtractResDec

ExtDim iterates through each character of the string, manually counting its length and decimal position along the way. Realistically, we only need to run that once, not once per character. Moving things around a bit and reverting to the original algorithms gave me this:
-16 OK 16:55:20.62 16:55:20.47 0:0:0.15
-17 OK 16:55:20.85 16:55:20.62 0:0:0.23
-18 OK 16:55:22.49 16:55:20.87 0:0:1.62
-19 OK 16:55:25.67 16:55:22.51 0:0:3.16
-20 OK 16:55:41.43 16:55:25.68 0:0:15.75

So, even with my original algorithm fixed, the For loop method still remained faster, though nowhere near as much faster as I was hoping since the division zero stripping algorithm isn't as broken as my original subtraction zero stripping algorithm.

I have some other ideas to move division along, though. Stay tuned...

No comments:

Post a Comment