Thanks JCHD. Your input is very welcome. I took my information from here: http://www.fileformat.info/info/unicode/category/index.htm I'll look at the differences you have posted. Some of the punctuation symbols may only be used in one language. I see there are still issues with some unicode standards.


There are 103 discrepancies. Perhaps some symbols have been recategorized. I'm quite happy for this to be a greedy function. It's very good for stripping away code - leaving mainly text.

Looking at the code, I believe it has the same issue that is exposed in

That particular issue should not be affecting any of the functions above - although I have now renamed one of them. My function to strip unicode is actually converting text to ANSI. It's not quite the same thing. :whistle:

See related followup in the other thread: Chr() vs ChrW() explains the pseudo-issue.

It appears that some string functions are not compatible with unicode, _StringReverse() fails if the string contains non ANSI characters. Here's an alternative which works with unicode, but not with surrogate pairs. That also needs addressing. I'll look at it later.

Func _StringReverseEx($sString)
    If Not (IsString($sString) Or StringLen($sString)) Then Return SetError(1, 0, $sString)
    Local $sNewString = ""
    For $i = StringLen($sString) To 1 Step -1
        $sNewString &= StringMid($sString, $i, 1)
    Return $sNewString
_StringReverse() fails if the string contains non ANSI characters

Fixed in the beta version of course as you know.

I'm glad to hear that. I used to follow the changes by reading about them in the latest beta thread (which seems to have died a death). I haven't got enough machines to run all these different versions, and every time I follow advice about configuration, things end up in a different and confused state. The most important thing to me is to get some of my projects off the ground. It's a long enough haul as it is. I envisage several more releases of AutoIt before I finish one particular long standing project (actually it's my first autoit project that is taking longer than 10 years). If I had a team working for me I could finish it faster than you could say Jack Robinson.

Here's a small sample of the code from that first project. Everything needs to be rewritten, and I intend to write some of it in trinary instead of binary. Trinary will run along side binary to get the best of both worlds. This is just a tiny sample taken from my first attempts on this project - a snapshot taken from several thousand lines of unreadable code. It won't be any easier to read in trinary: just easier to cope with (or so I believe). These strings actually represent guitar chord shapes, but I'm not entirely happy with the Forsyth-Edwards like notation I created, although it's not too bad as far as notation goes. The problem is handling the bulk of variants. A better system is needed.


Before I attempt to rewrite this, I must first write another language after AutoMathEdit script has been rewritten (that's two new languages - both will need documenting). Then there's an interpreter to create for the second language. Only then am I likely to find time to make decisions about storing complicated harmonic data. The AutoMathEdit (editor) program is now my main app for handling these complicated monsters, and even that is unfinished. I do not have as much free time as you might think. If all goes well, I will be able to resume work on this project in 2014.

The second language I am writing will look something like the following example. It will be case sensitive, space sensitive, and hard to learn, but possible - and that is the point - it will be possible to learn. The complication stems from being limited to ASCII. The details are not yet fully defined. This pseudo code relies heavily on inheritance - always looking back to previous commands to interpret consecutive new commands. It's the only way forward, unless someone radically turns the computing world on its head big time. That's not likely to happen to such an extent as would be required to simplify this any further.

|+Ab !G F|: !C D 1.. o >~"C oD | -'G' G ~'Gbb' *3:|

Actually the above string reads as garbage. It's readable garbage though (which is about as good as it gets under the heavy constraints). :) Something like this is certainly needed. Compair it to what is available: lilypond and you'll understand.

Lilypond is cool but totally unreadable. No way would you ever envisage the output judging by that input. Perhaps you could say that the code suffers from being in typical script format - reading (vertically) like Chinese. Computers can read it, and that's that. Lilypond is basically too difficult for a musician to read, never mind a programmer. Think about the order of words in that last sentence. If a programmer can't read it, that's a big problem. You will never find one who can, I'm pretty sure. Reading music is hard enough without spreading it around the page in various functions. <rant> :laser:

Sometimes I think I've bitten off far more than I can chew, although I certainly don't intend to abandon my goals. Ten years is nothing in reality. Learning a programming language is just a single rung on a much larger ladder. Sometimes you just have to take the bull by the horns, regardless of its size.

After I first posted this I thought I'd wait a while and then delete it. Instead I just keep thinking of new things to say. AutoIt has given me the opportunity to pursue my goals, I know many here focus on automating office apps and that kind of thing. I suppose if I had a paid job, I'd break the back of that MSO, but it just doesn't interest me.

I think computers should be used to benefit mankind, not to automatically print triplicate copies of the world's financial misdemeanors, or turn the world's youth into couch potato gamesters. All I can say is that I'm trying to use my computer as a computer. I desire to calculate everything to the last digit. I might not achieve what I set out to do, but it won't stop me trying.

Despite it being fixed in the beta, I decided to add changes to the above function anyway. It now reverses strings without corrupting surrogate pairs. Perhaps that is also in the beta, I don't know. It's a tricky question how to reverse such strings - whether to keep surrogates intact or whether to allow them to be corrupted. Perhaps the second choice is the more natural.

Surrogate pairings remain intact.

Func _ReverseWithSurrogates($sString)
    If Not (IsString($sString) And StringLen($sString)) Then Return SetError(1, 0, "")
    Local $sNextChar, $sSurrogate, $sNewString = ""

    For $i = StringLen($sString) To 1 Step -1
        $sNextChar = StringMid($sString, $i, 1)
        Switch AscW($sNextChar)
            Case 0xDC00 To 0xDFFF
                If $i > 1 Then
                    $sSurrogate = StringMid($sString, $i -1, 1)
                    Switch AscW($sSurrogate)
                        Case 0xD800 To 0xDBFF
                            $sNextChar = $sSurrogate & $sNextChar
                            $i -= 1
        $sNewString &= $sNextChar
    Return $sNewString

Function has been tested and is working. It probably can be improved.

I thought you were opposed to this approach due to an increase in execution?

Looks good to me, but I was just thinking of our previous discussion about _StringReverse.

Hmm, I don't remember the discussion. Necessity for speed depends on context. General functions should really be fast: they'll get used in loops quite often. It's unlikely you will need to worry too much about surrogates unless you intend to display them as characters (probably). You need to test for their existance if you want to reverse such strings, unless you purposefully choose to ignore their possible presence. They will become corrupted in this case and no longer display correctly. If speed is an issue then I would probably substitute the surrogates for single characters first, and replace them after processing.

Edited by czardas
I also remember this discussion about speed. But speed is nothing when correctness is compromized.

Reversing a Unicode string is much harder than reversing codepoints, even doing that correctly with surrogates.

One must reverse grapheme clusters and that is way harder. A very common example of grapheme cluster is @CRLF (albeit not a exactly a grapheme in the linguistic sense): reversing it shouldn't give LF followed by CR.

EDIT: I've already exposed the issue

Edited by jchd

This is the discussion I was talking about >>

Ah that was not long ago at all. I remember it being a good discussion indeed, and also my speed tests. I know a little more about the subject now thanks to those who have helped me. I still need speed for music playback. It is the highest priority. An out of tune note is better than one played out of rhythm. Anyhow those types of strings will not contain unicode.

Edited by czardas
Any discussion with the three/four of us (jchd, you, me + BrewManNH) always ends as a huge learning curve. I've noticed you and BrewManNH gang up on me sometimes.


Would this work for Unicode in a general sense?

#include <Constants.au3>
#include <array.au3>
MsgBox($MB_SYSTEMMODAL, '', stringReverse("Klüft skräms inför på fédéral électoral große"))

Func stringReverse($input)
    If $input = "" Then Return SetError(1, 0, 0)
    $input = StringSplit($input, "", 2)
    Local $output[UBound($input)]
    $inputIndex = UBound($input) - 1
    For $outputIndex = 0 To $inputIndex
        If ($input[$inputIndex] >= 0xDC00 And $input[$inputIndex] <= 0xDFFF And _
                $inputIndex > 0 And $input[$inputIndex - 1] >= 0xD800 And $input[$inputIndex - 1] <= 0xDBFF) Then
;~           / / preserve the order of the surrogate pair code units
            $output[$outputIndex + 1] = $input[$inputIndex]
            $output[$outputIndex] = $input[$inputIndex - 1]
            $output[$outputIndex] = $input[$inputIndex]
            $inputIndex -= 1
    Return _ArrayToString($output, "");
EndFunc   ;==>stringReverse

I modified the code found on this page to work in AutoIt, I THINK I got it converted correctly.

Would this work for Unicode in a general sense?

In theory it should work for surrogates, but it won't work for grapheme clusters, which JCHD has brought to everyone's attention. However your code is not working because it contains errors. I notice I also ought to change my code to test for high and low order surrogates. Having said this, it seems they should not appear out of sequence, although sometimes they might. I need to read more. The grapheme clusters are interresting - that's something I need to look at too.

Edited by czardas

What errors are you seeing?

What errors are you seeing?

#include <Constants.au3>
#include <array.au3>
MsgBox($MB_SYSTEMMODAL, '', stringReverse("Klüft skräms inför på fédéral électoral große"))

Func stringReverse($input)
    If $input = "" Then Return SetError(1, 0, 0)
    $input = StringSplit($input, "", 2)
    Local $output[UBound($input)]
    $inputIndex = UBound($input) - 1
    For $outputIndex = 0 To $inputIndex
        If (AscW($input[$inputIndex]) >= 0xDC00 And AscW($input[$inputIndex]) <= 0xDFFF And _
                $inputIndex > 0 And AscW($input[$inputIndex - 1]) >= 0xD800 And AscW($input[$inputIndex - 1]) <= 0xDBFF) Then
;~         / / preserve the order of the surrogate pair code units
            $output[$outputIndex + 1] = $input[$inputIndex]
            $output[$outputIndex] = $input[$inputIndex - 1]
            $outputIndex +=1
            $inputIndex -= 2
            $output[$outputIndex] = $input[$inputIndex]
            $inputIndex -= 1
    Return _ArrayToString($output, "");
EndFunc   ;==>stringReverse
Edited by czardas

Am I missing something? Isn't that the same script I posted except with the extra AscW function? Where's the error in the first post, because the output is identical.

