db48x/doc/6-Performance.md

# Performance measurements

This sections tracks some performance measurements across releases.

## NQueens (DM42)

Performance recording for various releases on DM42 with `small` option (which is
the only one that fits all releases). This is for the same `NQueens` benchmark,
all times in milliseconds, best of 5 runs, on USB power, with presumably no GC.


| Version | Time    | PGM Size  | QSPI Size | Note                    |
|---------|---------|-----------|-----------|-------------------------|
| 0.5.2   | 1310    | 711228    | 1548076   |                         |
| 0.5.1   |         |           |           |                         |
| 0.4.10+ | 1205    | 651108    |           | RPL stack runloop       |
| 0.4.10  | 1070    | 650116    |           | Focused optimizations   |
| 0.4.9+  | 1175    |           |           | Range-based type checks |
| 0.4.9+  | 1215    |           |           | Remove busy animation   |
| 0.4.9   | 1447    | 646028    | 1531868   | No LastArgs in progs    |
| 0.4.8   | 1401    | 633932    | 1531868   |                         |
| 0.4.7   | 1397    | 628188    | 1531868   |                         |
| 0.4.6   | 1380    | 629564    | 1531868   |                         |
| 0.4.5   | 1383    | 624572    | 1531868   |                         |
| 0.4.4   | 1377    | 624656    | 1531868   | Implements Undo/LastArg |
| 0.4.3S  | 1278    | 617300    | 1523164   | 0.4.3 build "small"     |
| 0.4.3   | 1049    | 717964    | 1524812   | Switch to -Os           |
| 0.4.2   | 1022    | 708756    | 1524284   |                         |
| 0.4.1   | 1024    | 687444    | 1522788   |                         |
| 0.4     |  998    | 656516    | 1521748   | Feature tests 7541edf   |
| 0.3.1   |  746    | 618884    | 1517620   | Faster busy 3f3ab4b     |
| 0.3     |  640    | 610820    | 1516900   | Busy anim 4ab3c97       |
| 0.2.4   |  522    | 597372    | 1514292   |                         |
| 0.2.3   |  526    | 594724    | 1514276   | Switching to -O2        |
| 0.2.2   |  723    | 540292    | 1512980   |                         |


## NQueens (DM32)

Performance recording for various releases on DM32 with `fast` build option.
This is for the same `NQueens` benchmark, all times in milliseconds,
best of 5 runs. There is no GC column, because it's harder to trigger given how
much more memory the calculator has. Also, experimentally, the numbers for the
USB and battery measurements are almost identical at the moment. As I understand
it, there are plans for a USB overclock like on the DM42, but at the moment it
is not there.


| Version | Time    | PGM Size  | QSPI Size | Note                    |
|---------|---------|-----------|-----------|-------------------------|
| 0.5.2   | 1752    |           |           |
| 0.5.1   | 1746    |           |           |
| 0.5.0   | 1723    |           |           |
| 0.4.10+ | 1804    | 761252    |           | RPL stack runloop       |
| 0.4.10  | 1803    | 731052    |           | Focused optimizations   |
| 0.4.9   | 2156    | 772732    | 1534316   | No LastArg in progs     |
| 0.4.8   | 2201    | 749892    | 1534316   |                         |
| 0.4.7   | 2209    | 742868    | 1534316   |                         |
| 0.4.6   | 2204    | 743492    | 1534316   |                         |
| 0.4.5   | 2171    | 730092    | 1534316   |                         |
| 0.4.4   | 2170    | 730076    | 1534316   | Implements Undo/LastArg |
| 0.4.3   | 2081    | 718020    | 1527092   |                         |
| 0.4.2   | 2242    | 708756    | 1524284   |                         |
| 0.4.1   | 2152    | 687500    | 1522788   |                         |
| 0.4     |         |           |           | Feature tests 7541edf   |
| 0.3.1   |         |           |           |                         |
| 0.3     |         |           |           |                         |
| 0.2.4   |         |           |           |                         |
| 0.2.3   |         |           |           |                         |


## Collatz conjecture check

This test checks the tail recursion optimization in the RPL interpreter.
The code can be found in the `CBench` program in the `Demo.48S` state.
The HP48 cannot run the benchmark because it does not have integer arithmetic.

Timing on 0.4.10 are:

* HP50G: 397.438s
* DM32: 28.507s (14x faster)
* DM42: 15.769s (25x faster)

| Version | DM32 ms | DM42 ms |
|---------|---------|---------|
| 0.5.2   | 26733   |  15695  |
| 0.4.10  | 28507   |  15769  |


## SumTest (decimal performance)

| Version | DM32 ms | DM42 ms |
|---------|---------|---------|
| 0.5.2   | 215421  |  143412 |

## Drawing `sin X` with `FunctionPlot`

DM32 Intel Decimal: 2332 - 5140
DM32 variable precision (6): 2423 -
DM32 variable precision (24): 3863 - 6005
DM32 variable precision (36): 6567 - 10186
DM32 variable precision (48): 8377 - 10259

Crash at precision 3
Performance optimization Performance optimization of object dispatch Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-03 19:42:12 +01:00			`# Performance measurements`

			`This sections tracks some performance measurements across releases.`

			`## NQueens (DM42)`

			Performance recording for various releases on DM42 with `small` option (which is
			the only one that fits all releases). This is for the same `NQueens` benchmark,
			`all times in milliseconds, best of 5 runs, on USB power, with presumably no GC.`


			`\| Version \| Time \| PGM Size \| QSPI Size \| Note \|`
			`\|---------\|---------\|-----------\|-----------\|-------------------------\|`
Release 0.5.2 "Christmas Eve": Reaching hard limits on DM42 This release was a bit longer in coming than earlier ones, because we are about to reach the limits of what can fit on a DM42. This release uses 711228 bytes out of the 716800 (99.2%). Without the Intel Decimal Library code, we use only 282980 bytes. This means that the Intel Decimal Library code uses 60.2% of the total code space. Being able to move further requires a rather radical rethinking of the project, where we replace the Intel Decimal Library with size-optimized decimal code. As a result, release 0.5.2 will be the last one using the Intel Decimal Library, and is release in parallel with 0.6.0, which switches to a table-free and variable-precisions implementation of decimal code that uses much less code space. The two releases should otherwise be functionally identical New features * Shift and rotate instructions (#622) * Add `CompatibleTypes` and `DetsailedTypes` setting to control `Type` results * Recognize HP-compatible negative values for flags, e.g. `-64 SF` (#625) * Add settings to control multiline result and stack display (#634) Bug fixes * Truncate to `WordSize` the small results of binary operations (#624) * Fix day-of-week shortcut in simulator * Avoid double-evaluation of immediate commands when there is no help * Generate an error when selecting base 1 (#628) * Avoid `Number too big` error on based nunbers * Correctly garbage-collect menu entries (#630) * Select default settings that allow solver to find solutions (#627) * Fix display of decimal numbers (broken by multi-line display) * Fix rendering of menu entries for `Fix`, `Std`, etc * Detect non-finite results in arithmetic, e.g. `(-8)^0.3`m (#635, #639) * Fix range-checking for `Dig` to allow `-1` value * Accept large values for `Fix`, `Sci` and `Eng` (for variable precision) * Restore missing last entry in built-in units menu (#638) * Accept `Hz` and non-primary units as input for `ConvertToUnitPrefix` (#640) * Fix LEB128 encoding for signed value 64 and similar (#642) * Do not parse `IfThenElse` as a command * Do not consider `E` as a digit in decimal numbers (#643) * Do not parse `min` as a function in units, but as minute (#644) Improvements * Add `OnesComplement` flag for binary operation (not used yet) * Add `ComplexResults` (-103) flag (not used yet) * Accept negative values for `B→R` (according to `WordSize`) * Add documentation for `STO` and `RCL` accessing flash storage * Mention `True` and `False` in documentation * Rename `MaxBigNumBits` to `MaxNumberBits` * Return HP-compatible values from `Type` function * Minor optimization of flags implementation * Catalog auto-completion now suggests all possible spellings (#626) * Add aliases for `CubeRoot` and `Hypothenuse` * Align based number promotion rules to HP calculators (#629) * Expand the range of garbage collector integrity check on simulator * Show command according to preferences in error messages (#633) * Avoid crash in `debug_printf` if used before font initialization * Update performance data in documentation * Add ability to disable any reference to Intel Decimal Floating-point library * Simplify C++ notations for safe pointers (`+x` and `operartor bool()`) * Fix link to old `db48x` project in `README.md` Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-12-25 17:09:31 +01:00			`\| 0.5.2 \| 1310 \| 711228 \| 1548076 \| \|`
			`\| 0.5.1 \| \| \| \| \|`
runtime: Remove the `execute` callback Defer program and expression execution to the `program` class. This allows us to remove `execute()`, which applies to every object, with `run()`, which only exists for `program`. There is also a `program::run()` static member that dynamic checks if we should run `evaluate()` or if we can `run()`. Also reimplement `while`, `until`, `if`, `ift`and `ifte` using a deferred conditional so as to avoid C++ stack recursion. This allows us to have really good behaviour on tail recursion, see the `Collatz` and `CBench` examples in the `Demo.48S` file. For this particular case, DB48X on DM42 is 25x faster than the HP50G, which becomes slower as the recursion depth increases. Fixes: #537 Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-07 20:56:49 +01:00			`\| 0.4.10+ \| 1205 \| 651108 \| \| RPL stack runloop \|`
			`\| 0.4.10 \| 1070 \| 650116 \| \| Focused optimizations \|`
types: Reimplement a range-based type checking Reimplement a range-based type-checking that does not require a memory access and a bitmap check every time a type is checked. This brings the `NQueens` execution time on DM42 from 1215 to 1175, which is about 3%. Fixes: #532 Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-03 21:40:29 +01:00			`\| 0.4.9+ \| 1175 \| \| \| Range-based type checks \|`
performance: Reduce frequency of busy cursor drawing Retrict the busy cursor drawing to key transitions: * Starting to evaluate the command line * Drawing the stack * Entering / exiting the garbage collector This leads to much less frequent animation, but gets us 20% back on the `NQueens` benchmark on DM42 (from 1447 down to 1215ms). Fixes: #531 Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-03 20:31:29 +01:00			`\| 0.4.9+ \| 1215 \| \| \| Remove busy animation \|`
Performance optimization Performance optimization of object dispatch Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-03 19:42:12 +01:00			`\| 0.4.9 \| 1447 \| 646028 \| 1531868 \| No LastArgs in progs \|`
			`\| 0.4.8 \| 1401 \| 633932 \| 1531868 \| \|`
			`\| 0.4.7 \| 1397 \| 628188 \| 1531868 \| \|`
			`\| 0.4.6 \| 1380 \| 629564 \| 1531868 \| \|`
			`\| 0.4.5 \| 1383 \| 624572 \| 1531868 \| \|`
			`\| 0.4.4 \| 1377 \| 624656 \| 1531868 \| Implements Undo/LastArg \|`
			`\| 0.4.3S \| 1278 \| 617300 \| 1523164 \| 0.4.3 build "small" \|`
			`\| 0.4.3 \| 1049 \| 717964 \| 1524812 \| Switch to -Os \|`
			`\| 0.4.2 \| 1022 \| 708756 \| 1524284 \| \|`
			`\| 0.4.1 \| 1024 \| 687444 \| 1522788 \| \|`
			`\| 0.4 \| 998 \| 656516 \| 1521748 \| Feature tests 7541edf \|`
			`\| 0.3.1 \| 746 \| 618884 \| 1517620 \| Faster busy 3f3ab4b \|`
			`\| 0.3 \| 640 \| 610820 \| 1516900 \| Busy anim 4ab3c97 \|`
			`\| 0.2.4 \| 522 \| 597372 \| 1514292 \| \|`
			`\| 0.2.3 \| 526 \| 594724 \| 1514276 \| Switching to -O2 \|`
			`\| 0.2.2 \| 723 \| 540292 \| 1512980 \| \|`


			`## NQueens (DM32)`

			Performance recording for various releases on DM32 with `fast` build option.
			This is for the same `NQueens` benchmark, all times in milliseconds,
			`best of 5 runs. There is no GC column, because it's harder to trigger given how`
			`much more memory the calculator has. Also, experimentally, the numbers for the`
			`USB and battery measurements are almost identical at the moment. As I understand`
			`it, there are plans for a USB overclock like on the DM42, but at the moment it`
			`is not there.`


			`\| Version \| Time \| PGM Size \| QSPI Size \| Note \|`
			`\|---------\|---------\|-----------\|-----------\|-------------------------\|`
Release 0.5.2 "Christmas Eve": Reaching hard limits on DM42 This release was a bit longer in coming than earlier ones, because we are about to reach the limits of what can fit on a DM42. This release uses 711228 bytes out of the 716800 (99.2%). Without the Intel Decimal Library code, we use only 282980 bytes. This means that the Intel Decimal Library code uses 60.2% of the total code space. Being able to move further requires a rather radical rethinking of the project, where we replace the Intel Decimal Library with size-optimized decimal code. As a result, release 0.5.2 will be the last one using the Intel Decimal Library, and is release in parallel with 0.6.0, which switches to a table-free and variable-precisions implementation of decimal code that uses much less code space. The two releases should otherwise be functionally identical New features * Shift and rotate instructions (#622) * Add `CompatibleTypes` and `DetsailedTypes` setting to control `Type` results * Recognize HP-compatible negative values for flags, e.g. `-64 SF` (#625) * Add settings to control multiline result and stack display (#634) Bug fixes * Truncate to `WordSize` the small results of binary operations (#624) * Fix day-of-week shortcut in simulator * Avoid double-evaluation of immediate commands when there is no help * Generate an error when selecting base 1 (#628) * Avoid `Number too big` error on based nunbers * Correctly garbage-collect menu entries (#630) * Select default settings that allow solver to find solutions (#627) * Fix display of decimal numbers (broken by multi-line display) * Fix rendering of menu entries for `Fix`, `Std`, etc * Detect non-finite results in arithmetic, e.g. `(-8)^0.3`m (#635, #639) * Fix range-checking for `Dig` to allow `-1` value * Accept large values for `Fix`, `Sci` and `Eng` (for variable precision) * Restore missing last entry in built-in units menu (#638) * Accept `Hz` and non-primary units as input for `ConvertToUnitPrefix` (#640) * Fix LEB128 encoding for signed value 64 and similar (#642) * Do not parse `IfThenElse` as a command * Do not consider `E` as a digit in decimal numbers (#643) * Do not parse `min` as a function in units, but as minute (#644) Improvements * Add `OnesComplement` flag for binary operation (not used yet) * Add `ComplexResults` (-103) flag (not used yet) * Accept negative values for `B→R` (according to `WordSize`) * Add documentation for `STO` and `RCL` accessing flash storage * Mention `True` and `False` in documentation * Rename `MaxBigNumBits` to `MaxNumberBits` * Return HP-compatible values from `Type` function * Minor optimization of flags implementation * Catalog auto-completion now suggests all possible spellings (#626) * Add aliases for `CubeRoot` and `Hypothenuse` * Align based number promotion rules to HP calculators (#629) * Expand the range of garbage collector integrity check on simulator * Show command according to preferences in error messages (#633) * Avoid crash in `debug_printf` if used before font initialization * Update performance data in documentation * Add ability to disable any reference to Intel Decimal Floating-point library * Simplify C++ notations for safe pointers (`+x` and `operartor bool()`) * Fix link to old `db48x` project in `README.md` Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-12-25 17:09:31 +01:00			`\| 0.5.2 \| 1752 \| \| \|`
docs: Update performance information for 0.5.0 and 0.5.1 Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-12-02 17:18:23 +01:00			`\| 0.5.1 \| 1746 \| \| \|`
			`\| 0.5.0 \| 1723 \| \| \|`
runtime: Remove the `execute` callback Defer program and expression execution to the `program` class. This allows us to remove `execute()`, which applies to every object, with `run()`, which only exists for `program`. There is also a `program::run()` static member that dynamic checks if we should run `evaluate()` or if we can `run()`. Also reimplement `while`, `until`, `if`, `ift`and `ifte` using a deferred conditional so as to avoid C++ stack recursion. This allows us to have really good behaviour on tail recursion, see the `Collatz` and `CBench` examples in the `Demo.48S` file. For this particular case, DB48X on DM42 is 25x faster than the HP50G, which becomes slower as the recursion depth increases. Fixes: #537 Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-07 20:56:49 +01:00			`\| 0.4.10+ \| 1804 \| 761252 \| \| RPL stack runloop \|`
			`\| 0.4.10 \| 1803 \| 731052 \| \| Focused optimizations \|`
Performance optimization Performance optimization of object dispatch Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-03 19:42:12 +01:00			`\| 0.4.9 \| 2156 \| 772732 \| 1534316 \| No LastArg in progs \|`
			`\| 0.4.8 \| 2201 \| 749892 \| 1534316 \| \|`
			`\| 0.4.7 \| 2209 \| 742868 \| 1534316 \| \|`
			`\| 0.4.6 \| 2204 \| 743492 \| 1534316 \| \|`
			`\| 0.4.5 \| 2171 \| 730092 \| 1534316 \| \|`
			`\| 0.4.4 \| 2170 \| 730076 \| 1534316 \| Implements Undo/LastArg \|`
			`\| 0.4.3 \| 2081 \| 718020 \| 1527092 \| \|`
			`\| 0.4.2 \| 2242 \| 708756 \| 1524284 \| \|`
			`\| 0.4.1 \| 2152 \| 687500 \| 1522788 \| \|`
			`\| 0.4 \| \| \| \| Feature tests 7541edf \|`
			`\| 0.3.1 \| \| \| \| \|`
			`\| 0.3 \| \| \| \| \|`
			`\| 0.2.4 \| \| \| \| \|`
			`\| 0.2.3 \| \| \| \| \|`
runtime: Remove the `execute` callback Defer program and expression execution to the `program` class. This allows us to remove `execute()`, which applies to every object, with `run()`, which only exists for `program`. There is also a `program::run()` static member that dynamic checks if we should run `evaluate()` or if we can `run()`. Also reimplement `while`, `until`, `if`, `ift`and `ifte` using a deferred conditional so as to avoid C++ stack recursion. This allows us to have really good behaviour on tail recursion, see the `Collatz` and `CBench` examples in the `Demo.48S` file. For this particular case, DB48X on DM42 is 25x faster than the HP50G, which becomes slower as the recursion depth increases. Fixes: #537 Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-11-07 20:56:49 +01:00

			`## Collatz conjecture check`

			`This test checks the tail recursion optimization in the RPL interpreter.`
			The code can be found in the `CBench` program in the `Demo.48S` state.
			`The HP48 cannot run the benchmark because it does not have integer arithmetic.`

			`Timing on 0.4.10 are:`

			`* HP50G: 397.438s`
			`* DM32: 28.507s (14x faster)`
			`* DM42: 15.769s (25x faster)`
Release 0.5.2 "Christmas Eve": Reaching hard limits on DM42 This release was a bit longer in coming than earlier ones, because we are about to reach the limits of what can fit on a DM42. This release uses 711228 bytes out of the 716800 (99.2%). Without the Intel Decimal Library code, we use only 282980 bytes. This means that the Intel Decimal Library code uses 60.2% of the total code space. Being able to move further requires a rather radical rethinking of the project, where we replace the Intel Decimal Library with size-optimized decimal code. As a result, release 0.5.2 will be the last one using the Intel Decimal Library, and is release in parallel with 0.6.0, which switches to a table-free and variable-precisions implementation of decimal code that uses much less code space. The two releases should otherwise be functionally identical New features * Shift and rotate instructions (#622) * Add `CompatibleTypes` and `DetsailedTypes` setting to control `Type` results * Recognize HP-compatible negative values for flags, e.g. `-64 SF` (#625) * Add settings to control multiline result and stack display (#634) Bug fixes * Truncate to `WordSize` the small results of binary operations (#624) * Fix day-of-week shortcut in simulator * Avoid double-evaluation of immediate commands when there is no help * Generate an error when selecting base 1 (#628) * Avoid `Number too big` error on based nunbers * Correctly garbage-collect menu entries (#630) * Select default settings that allow solver to find solutions (#627) * Fix display of decimal numbers (broken by multi-line display) * Fix rendering of menu entries for `Fix`, `Std`, etc * Detect non-finite results in arithmetic, e.g. `(-8)^0.3`m (#635, #639) * Fix range-checking for `Dig` to allow `-1` value * Accept large values for `Fix`, `Sci` and `Eng` (for variable precision) * Restore missing last entry in built-in units menu (#638) * Accept `Hz` and non-primary units as input for `ConvertToUnitPrefix` (#640) * Fix LEB128 encoding for signed value 64 and similar (#642) * Do not parse `IfThenElse` as a command * Do not consider `E` as a digit in decimal numbers (#643) * Do not parse `min` as a function in units, but as minute (#644) Improvements * Add `OnesComplement` flag for binary operation (not used yet) * Add `ComplexResults` (-103) flag (not used yet) * Accept negative values for `B→R` (according to `WordSize`) * Add documentation for `STO` and `RCL` accessing flash storage * Mention `True` and `False` in documentation * Rename `MaxBigNumBits` to `MaxNumberBits` * Return HP-compatible values from `Type` function * Minor optimization of flags implementation * Catalog auto-completion now suggests all possible spellings (#626) * Add aliases for `CubeRoot` and `Hypothenuse` * Align based number promotion rules to HP calculators (#629) * Expand the range of garbage collector integrity check on simulator * Show command according to preferences in error messages (#633) * Avoid crash in `debug_printf` if used before font initialization * Update performance data in documentation * Add ability to disable any reference to Intel Decimal Floating-point library * Simplify C++ notations for safe pointers (`+x` and `operartor bool()`) * Fix link to old `db48x` project in `README.md` Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-12-25 17:09:31 +01:00
			`\| Version \| DM32 ms \| DM42 ms \|`
			`\|---------\|---------\|---------\|`
			`\| 0.5.2 \| 26733 \| 15695 \|`
			`\| 0.4.10 \| 28507 \| 15769 \|`



			`## SumTest (decimal performance)`

			`\| Version \| DM32 ms \| DM42 ms \|`
			`\|---------\|---------\|---------\|`
			`\| 0.5.2 \| 215421 \| 143412 \|`
docs: Record performance comparing variable-precision with BID128 As expected, variable precision incurs a performance cost, but it is quite reasonable. Signed-off-by: Christophe de Dinechin <christophe@dinechin.org> 2023-12-14 19:37:20 +01:00
			## Drawing `sin X` with `FunctionPlot`

			`DM32 Intel Decimal: 2332 - 5140`
			`DM32 variable precision (6): 2423 -`
			`DM32 variable precision (24): 3863 - 6005`
			`DM32 variable precision (36): 6567 - 10186`
			`DM32 variable precision (48): 8377 - 10259`

			`Crash at precision 3`