Correct regex syntax for hex characters

leuce · March 18, 2013

G'day everyone

I'm trying to evaluate strings on whether they contain only certain characters, which I specify in hexademical format. However, I have no idea how to write the regular expression, and all my tinkering produces the wrong results.

What I'm ultimately trying to accomplish is to test if a string contains only valid XML 1.0 characters.

The valid characters that I'm trying to evaluate are:

\x0009

\x000A

\x000D

\x0020-\xD7FF

\xE000-\xFFFD

\x10000-\x10FFFF

I want to specify them all as a single variable ($sValid), which I will then include in the regular expression, as follows:

If StringRegExp($aArray[$i], "\A[" & $sValid & "]*\Z") Then
; Then $aArray[$i] contains only valid characters
EndIf

The file that I read the input from is UTF8 (but does that matter?).

How would I have to write the variable $sValid to let the regular expression work?

Thanks

Samuel

czardas · March 18, 2013

I take it the last three sets include all values within the range (x0020-xD7FF ==> from x0020 to xD7FF).

Edited March 18, 2013 by czardas

leuce · March 18, 2013

I take it the last three sets include all values within the range (x0020-xD7FF ==> from x0020 to xD7FF).

If I understand correctly, the last two sets are outside the range of the third last set. Or... what don't I understand? :-)

czardas · March 18, 2013

I'm asking if that expression indicates one match or multiple matches within the hexadecimal range. What does the minus symbol indicate? Edit: Perhaps a silly question.

Edited March 18, 2013 by czardas

Melba23 · March 18, 2013

leuce,

Can you please post an example string so that we can see the format.

M23

leuce · March 18, 2013

I'm asking if that expression indicates one match or multiple matches within the hexadecimal range. What does the minus symbol indicate? Edit: Perhaps a silly question.

The minus is my attempt at indicating a range. Ah, now I understand your original question: yes, the minus means "to". In the same way as one might have [a-zA-Z] in a regular expression, where the minus means "to".

I need to find out if the string I'm evaluating contains any character that is not any of those valid characters.

leuce · March 18, 2013

Can you please post an example string so that we can see the format.

Sure, but I don't know if it will help. It is a TMX file, which is XML. The original file is actually UTF16LE, and the XML is 1.0. But I resaved it as UTF8 because I thought that that was required for AutoIt's regex.

Here is one string that will be evaluated:

<tu>
<tuv xml:lang="NL-NL">
<seg>&lt;cf font="Arial" symboltypeface="Arial" ansitypeface="Arial" color="0xFFFFCC"&gt;&lt;cf size="32"&gt;Van bachelor naar master&lt;/cf&gt;&lt;cf size="28"&gt; &lt;br indentation="0" leftmargin="0" alignment="ppAlignLeft" spacewithin="1" spacebefore="0.5" spaceafter="0" mask="0"/&gt;</seg>
</tuv>
<tuv xml:lang="EN-GB">
<seg>&lt;cf font="Arial" symboltypeface="Arial" ansitypeface="Arial" color="0xFFFFCC"&gt;&lt;cf size="32"&gt;From Bachelor's to Master's&lt;/cf&gt;&lt;cf size="28"&gt; &lt;br indentation="0" leftmargin="0" alignment="ppAlignLeft" spacewithin="1" spacebefore="0.5" spaceafter="0" mask="0"/&gt;</seg>
</tuv>
</tu>

The file I'm trying to process is an XML file with invalid characters in it. I want to save as much of the file as possible while removing the invalid characters. Perhaps there is a freeware program somewhere on the internet that can do it too.

Added: Just in case the example is confusing, let me say that the hexadecimal colour codes in that text are not the characters that I'm trying to match. I'm trying to match individual characters. The above text contains only valid characters, but some of the strings may contain invalid ones.

Edited March 18, 2013 by leuce

AZJIO · March 18, 2013

leuce,

[a-zA-Z] ???

[a-fA-F] Yes?

[a-fA-F0-9]+

{?i}[0-9A-F]+

x10000-x10FFFF ???

x[0-9A-Fa-f]+ Yes?

Melba23 · March 18, 2013

leuce,

This appears to do what you want - but I am not totally confident as I have never used the multiple digit Hex pattern before. :wacko:

Note that I have reversed the logic - if you get a match then there is an unwanted character in the string; no match means that the string is good. That should allow you to use it as a RegExpReplace pattern to strip the unwanted characters if you decide to go that way:

(?i)(\x{000[12345678bcef]}|\x{001\d}|\x{d[89abcdef]\d\d}|\x{fff[ef]})

(?i)            - Case insensitive
( | | | | | )        - Look for any of these alternatives, which are:

\x{000[12345678bcef]}   - Any 000# character other than 0009, 000A, 000D
\x{001\d}        - Any 001# character
\x{d[89abcdef]\d\d}    - Any D8## to DF## character
\x{fff[ef]}        - FFFE and FFFF

Give it a try and see how you get on.

M23

leuce · March 18, 2013

Note that I have reversed the logic - if you get a match then there is an unwanted character in the string; no match means that the string is good.

I was actually thinking the same thing -- then I don't have to match the entire string, only one character in the string.

The curly brackets is what I was after -- I did not know exactly how to write the hex characters in a regular expression. I had thought that AutoIt would treat a hexademical character as a single unit in regular expressions, so I'm surprised to learn that something like "x{fff[ef]})" is possible.

Do you know if these regular expressions work on all files, or only on UTF8 files?

Thanks

Samuel

jchd · March 18, 2013

leuce,

FYI whatever text encoding is used, once you load text in AutoIt strings data is converted to a subset of UTF-16LE called UCS-2. It's (roughly) the restriction of Unicode to its plane-0, i.e. codepoints that fit into a single 16-bit representation. Hence the extra range x10000-x10FFFF is irrelevant to AutoIt (and, honestly, I seriously doubt you would encounter such codepoints in real-world data). Also surrogates will endup being remapped to invalid character in the conversion process occuring during file read.

The following pattern will match any invalid or excluded codepoint:

[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]

leuce · March 18, 2013

x10000-x10FFFF ???

The reason for that notation is that I had thought that AutoIt would treat the hexadecimal character as a unit within regex... in other words that it would treat "x10000" as a single entity and "x10FFFF" as a single entity, and that something like [x10000-x10FFFF] would be the same as something like [a-z].

Melba23 · March 18, 2013

leuce,

I'm surprised to learn that something like "x{fff[ef]})" is possible

To be honest, so was I!

The curly brackets are described on the Help file page for StringRegExp.

You will need to ask someone like jchd about the encoding - well above my hobbyist level that.

M23

Edit: I see he has already responded and given you a much more compact pattern.

jchd · March 18, 2013

That's correct but you need to use the correct hex syntax used in PCRE: either x** or x{******} where bold red asterisks are optional.

jchd · March 18, 2013

Sorry to add to myself.

Note that while "x{fff[ef]})" is indeed possible either by itself or within alternation (like "abc|def|x{fff[ef]})" the syntax won't work inside a character class (inside square brackets). Also, alternation is much slower than a character class: alternation works on complete sub-expressions while character classes work on individual characters.

Edited March 18, 2013 by jchd

Melba23 · March 18, 2013

jchd,

Nice pattern - I did not realise that you could do this:

\x20-\x{D7FF}

Learning point for today.

M23

czardas · March 18, 2013

I have to agree, very informative jchd.

leuce · March 18, 2013

Thanks, jchd, for the expression.

Perhaps one of you can tell me what is wrong with my script, because I know for a fact that there is at least one x1A character in it (possibly more, or other non-valid characters), but the script doesn't catch it. I know that there is an x1A character because I've seen it (it took a while to track it down in my 350 MB XML file, but it is there).

My script is this:

#cs

TMX Fixer (per-segment)

1. Read input file (TMX), then split by </tu>.
2. For each array item:
2.1 If it contains an invalid character, write it to an error file (TXT, one file per error).
2.2 If it does not contain an invalid character, write it to the output file (TMX).

Note: splitting by </tu> means that the head and the first TU are both in array item 1, but we assume that there are no invalid characters in the head or in the first TU.

#ce

MsgBox (0, "TMX Fixer (per-segment)", "The per-segment version of TMX Fixer examines every TU (aka translation unit, aka segment) individually and saves only segments that contain no invalid characters to the output TMX file, and saves removed segments with invalid characters to separate error files.", 0)

$pathtotmxfile = FileOpenDialog ("Select TMX file", @ScriptDir, "TMX (*.tmx)|All files (*.*)")

$tmxfileopen = FileOpen ($pathtotmxfile, 32)
$tmxfileread = FileRead ($tmxfileopen)

MsgBox (0, "TMX file read", @extended & " characters were read.", 0)

$tmxfilearray = StringSplit ($tmxfileread, "</tu>", 1)

MsgBox (0, "TMX file split", "TMX was split into approximately " & $tmxfilearray[0] - 1 & " translation units.", 0)

$outputtmxfileopen = FileOpen ($pathtotmxfile & "_output.tmx", 34)

Global $sInvalid = "[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}]"
; Global $sInvalid = "[\x00-\x08\x0B\x0C\x0E-\x1F\x{FFFE}\x{FFFF}]"

For $i = 1 to $tmxfilearray[0]

If StringRegExp ($tmxfilearray[$i], $sInvalid) Then
; $tmxfilearray[$i] = StringRegExpReplace ($tmxfilearray[$i], $sInvalid, "!!!$1!!!") ; when the rest works
$roguefileopen = FileOpen ($pathtotmxfile & "_broken segment_" & $i, 34)
FileWrite ($roguefileopen, $tmxfilearray[$i] & "</tu>")
FileClose ($roguefileopen)
Else
FileWrite ($outputtmxfileopen, $tmxfilearray[$i] & "</tu>")
EndIf

If IsInt ($i/10000) Then
ToolTip ("Currently at unit " & $i)
EndIf

Next

Edited March 18, 2013 by leuce

czardas · March 18, 2013

Are you sure you are reading all of the file? 350MB is big and I'm not sure if StringSplit can handle this. Although I would expect some out of memory message.

leuce · March 18, 2013

Are you sure you are reading all of the file? 350MB is big and I'm not sure if StringSplit can handle this. Although I would expect some out of memory message.

It reads it all (that's why the script tells the user how many characters are read, etc). The TMX file has about 650 000 </tu> tags in it, and the script reports that number to the user. The script reports about 185 million characters for both a UTF16LE file (350 MB) and a UTF8 file (180 MB), which sounds about right.

Running the whole script takes about a minute (I'm not sure how much the fact that I have 6 GB RAM on a quad core 64-bit computer has to do with it). Anyway, the first x1A character occurs at string number 115003, so the script should write an error file by then.

Could it be that the script is running too fast and therefore "misses" the match? That would be very odd...

Edited March 18, 2013 by leuce

Correct regex syntax for hex characters

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members