Mun Posted January 30, 2013 Share Posted January 30, 2013 (edited) I need a bit of help parsing few large excel files about 8mb each file filled with these variation.I want to capture what in bold only. These are all narrow down variation. Most of the time wording are misspelled or abbreviated.^(?:.*?)(\d+|\$\d+|\d+.\d+|\$\d+.\d+|\$\s\d+.\d+|\d+)invoice 928.00 paid 880.00 pricing.Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricinginv 1681.00 pd 1575.00 no payInvoice $80.00 Paid $79.50 paid per g(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricinginv 220.89 pd 212.10 paid per pricing less.Invoiced Amt $76, Paid $64.48 - paid as per flat feeInvice64.00 Paid 63.50 Paid per admin pricingInvoiced: $32.00 Paid: $30.00Inv. $136 Pd. $126 per flat rate of $50 forCan anyone guide me down or point me to the right direction? I am willing to learn.(Sorry didn't mean to hijack other thread.) Edited January 30, 2013 by Mun Link to comment Share on other sites More sharing options...
Robjong Posted January 30, 2013 Share Posted January 30, 2013 (edited) As ugly as the sample text, but it should work (for the sample at least): $s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF $s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF $s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF $s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF $s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF $s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF $s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF $s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF $s &= 'Invoiced: $32.00 Paid: $30.00' & @CRLF $s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF ;~ $a = StringRegExp($s, '(?im)^.*\bin[voiced]{0,6}(?:\h+[a-z]+\b)?\W*?\K(?:\$\h*)?\d+(?:[,.]\d{1,2})?(?=.*\bp[aid]{1,3}\W*?(\$?\h*\d+(?:[,.]\d{1,2})?))', 4) ; strict as per the sample text $a = StringRegExp($s, '(?im)^.*\bin[voiced]{0,6}(?:\h+[a-z]+\b)?\W*?\$?\h*\K\d+(?:[,.]\d{1,2})?(?=.*\bp[aid]{1,3}\W*?\$?\h*(\d+(?:[,.]\d{1,2})?))', 4) ; suggested pattern, does not capture $ For $i = 0 To UBound($a) - 1 Step 1 $b = $a[$i] ConsoleWrite(StringFormat('Invoiced: %.2f Paid: %.2f\n', $b[0], $b[1])) Next To understand this pattern we break it down. expandcollapse popup(?im) Options: i = case-insensetive matching, m = ^ and $ also match the start of a line and the end of a line respectively. ^.* ^ matches the start of the line. .* matches any character except line feed 0 or more times. This anchors our match to a line. \b \b matches a word boundary, which is a point between a non-word and word character, this helps to avoid 'in' being part of another word. in[voiced]{0,6} 'in' matches the start of the invoice-like word we want to precede the amount. [voiced]{0,6} matches v, o, i, c, e or d, 0 to 6 times. (?: ( opens a group, ?: makes it non-capturing. This group allows one word after 'invoiced'. \h+[a-z]+ \h+ matches 1 or more horizontal space. [a-z]+ matches any letter, 1 or more times. \b \b matches a word boundary, this makes sure that if this word is present one of the following optional parts (\W*?, \$? or \h*) matches. )? ) closes the group. ? makes it optional, match 0 or 1 times. \W*?\$?\h* \W* matches 0 or more non-word characters, ? makes the match lazy, as small as possible. \$? optionally matches a $ dollar sign, \ escapes the dollar sign, $ is the dollar sign, and again, the ? makes it optional. This allows the for spaces, dots and colons seen in the sample text, as well as the dollar sign and space. \K This is an interesting one, and it's not in the AutoIt help file (yet). It resets our global match, everything matched so far in the global match is discarded. Because we only match the first amount in the global match from here on, only the first amount will be in the global match (b[0]). We capture the second amount in a lookahead, which does not end up in the global match. In short \K basically turns the part of the pattern before it into a less restricted lookbehind. \d+(?:[,.]\d{1,2})? \d+ matches 1 or more digits. (?: open non-capturing group. [,.] matches comma or dot. \d{1,2} matches 1 or 2 digits. ) closes the group. ? makes it optional. This matches the first amount, e.g. 1, 23, 45.67 or 78,90. (?= ( opens a group, ?= makes it a positive lookahead, meaning the subpattern must match the subject ahead of this point. .*\b .* matches any character except line feed 0 or more times. \b matches a word boundary to help avoid p (see below) being part of another word. p[aid]{1,3} p[aid]{1,3} matches a, i or d 1 2 or 3 times. \W*?\$?\h* Covered this already. ( ( Opens a capturing group, which will capture the second amount (b[1]). \d+(?:[,.]\d{1,2})? Covered this already. ) Close capturing group. ) Close lookahead. Edited January 30, 2013 by Robjong Mun 1 Link to comment Share on other sites More sharing options...
Mun Posted January 30, 2013 Author Share Posted January 30, 2013 Ok Robjong. How did you come up with that pattern? I AM IMPRESSED!!! Did you use any regex builder? That's some intertwine algorithm to extract info. Link to comment Share on other sites More sharing options...
PhoenixXL Posted January 30, 2013 Share Posted January 30, 2013 How did you come up with that pattern?wondering!! Very hard regex Yet another attempt $s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF $s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF $s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF $s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF $s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF $s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF $s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF $s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF $s &= 'Invoiced: $32.00 Paid: $30.00' & @CRLF $s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF $s = StringRegExpReplace($s, '(?im)^[^i]+', '');Make the start from "I" $a = StringRegExp($s, '(?m)^(?:[[:alpha:].,: $]+)([\d.]+)(?:[[:alpha:].,: $]+)([\d.]+)', 4) If @error Then Exit @extended For $i = 0 To UBound($a) - 1 Step 1 $b = $a[$i] ConsoleWrite(StringFormat('Invoiced: %.2f Paid: %.2f\n', $b[1], $b[2])) Next Mun 1 My code: PredictText: Predict Text of an Edit Control Like Scite. Remote Gmail: Execute your Scripts through Gmail. StringRegExp:Share and learn RegExp.Run As System: A command line wrapper around PSEXEC.exe to execute your apps scripts as System (LSA). Database: An easier approach for _SQ_LITE beginners. MathsEx: A UDF for Fractions and LCM, GCF/HCF. FloatingText: An UDF for make your text floating. Clipboard Extendor: A clipboard monitoring tool. Custom ScrollBar: Scroll Bar made with GDI+, user can use bitmaps instead. RestrictEdit_SRE: Restrict text in an Edit Control through a Regular Expression. Link to comment Share on other sites More sharing options...
Mun Posted January 30, 2013 Author Share Posted January 30, 2013 I'm reading up and trying to learn from both of your expression. Both seems to use look ahead anchor starting point and group those number. What does this expression does? (?im) [:alpha:] Look ahead for alpha? $]+)([d.]+) Group digit but $ not include since it's being omit. Am I on right path? Lastly I tried to look at the array "$a" through _Arraydisplay, the array seem to be empty? $b = $a[$i] How did $a[] pull out information? *head scratched* Link to comment Share on other sites More sharing options...
Robjong Posted January 30, 2013 Share Posted January 30, 2013 (edited) It's a pretty basic pattern, and I just wrote it, no special tools. I will explain the pattern later today when I have more time. [ opens a character class [:alpha:] match uppercase and lowercase letters (POSIX character class) .,: $ matches dot, comma, semicolon, space or dollar sign. ] close the character class ( open capturing group [\d.]+ matches digits or dot, 1 or more times ) close capturing group Look in the help file for StringRegExp, see flag 4 for return values, it returns an array with arrays, not a 2D array. Edited January 30, 2013 by Robjong Mun 1 Link to comment Share on other sites More sharing options...
PhoenixXL Posted January 30, 2013 Share Posted January 30, 2013 (edited) Edit: Robjong explained it, (?m) makes ^ to match the SOF as well as the the point after a Vertical Space, and $ to match the EOF as well as the point before the Vertical Space(?i) makes case insensitiveboth can be clubbed as (?im) or (?mi) Edited January 30, 2013 by PhoenixXL My code: PredictText: Predict Text of an Edit Control Like Scite. Remote Gmail: Execute your Scripts through Gmail. StringRegExp:Share and learn RegExp.Run As System: A command line wrapper around PSEXEC.exe to execute your apps scripts as System (LSA). Database: An easier approach for _SQ_LITE beginners. MathsEx: A UDF for Fractions and LCM, GCF/HCF. FloatingText: An UDF for make your text floating. Clipboard Extendor: A clipboard monitoring tool. Custom ScrollBar: Scroll Bar made with GDI+, user can use bitmaps instead. RestrictEdit_SRE: Restrict text in an Edit Control through a Regular Expression. Link to comment Share on other sites More sharing options...
guinness Posted January 30, 2013 Share Posted January 30, 2013 Robjong is back to claim their title. -_0 UDF List: _AdapterConnections() • _AlwaysRun() • _AppMon() • _AppMonEx() • _ArrayFilter/_ArrayReduce • _BinaryBin() • _CheckMsgBox() • _CmdLineRaw() • _ContextMenu() • _ConvertLHWebColor()/_ConvertSHWebColor() • _DesktopDimensions() • _DisplayPassword() • _DotNet_Load()/_DotNet_Unload() • _Fibonacci() • _FileCompare() • _FileCompareContents() • _FileNameByHandle() • _FilePrefix/SRE() • _FindInFile() • _GetBackgroundColor()/_SetBackgroundColor() • _GetConrolID() • _GetCtrlClass() • _GetDirectoryFormat() • _GetDriveMediaType() • _GetFilename()/_GetFilenameExt() • _GetHardwareID() • _GetIP() • _GetIP_Country() • _GetOSLanguage() • _GetSavedSource() • _GetStringSize() • _GetSystemPaths() • _GetURLImage() • _GIFImage() • _GoogleWeather() • _GUICtrlCreateGroup() • _GUICtrlListBox_CreateArray() • _GUICtrlListView_CreateArray() • _GUICtrlListView_SaveCSV() • _GUICtrlListView_SaveHTML() • _GUICtrlListView_SaveTxt() • _GUICtrlListView_SaveXML() • _GUICtrlMenu_Recent() • _GUICtrlMenu_SetItemImage() • _GUICtrlTreeView_CreateArray() • _GUIDisable() • _GUIImageList_SetIconFromHandle() • _GUIRegisterMsg() • _GUISetIcon() • _Icon_Clear()/_Icon_Set() • _IdleTime() • _InetGet() • _InetGetGUI() • _InetGetProgress() • _IPDetails() • _IsFileOlder() • _IsGUID() • _IsHex() • _IsPalindrome() • _IsRegKey() • _IsStringRegExp() • _IsSystemDrive() • _IsUPX() • _IsValidType() • _IsWebColor() • _Language() • _Log() • _MicrosoftInternetConnectivity() • _MSDNDataType() • _PathFull/GetRelative/Split() • _PathSplitEx() • _PrintFromArray() • _ProgressSetMarquee() • _ReDim() • _RockPaperScissors()/_RockPaperScissorsLizardSpock() • _ScrollingCredits • _SelfDelete() • _SelfRename() • _SelfUpdate() • _SendTo() • _ShellAll() • _ShellFile() • _ShellFolder() • _SingletonHWID() • _SingletonPID() • _Startup() • _StringCompact() • _StringIsValid() • _StringRegExpMetaCharacters() • _StringReplaceWholeWord() • _StringStripChars() • _Temperature() • _TrialPeriod() • _UKToUSDate()/_USToUKDate() • _WinAPI_Create_CTL_CODE() • _WinAPI_CreateGUID() • _WMIDateStringToDate()/_DateToWMIDateString() • Au3 script parsing • AutoIt Search • AutoIt3 Portable • AutoIt3WrapperToPragma • AutoItWinGetTitle()/AutoItWinSetTitle() • Coding • DirToHTML5 • FileInstallr • FileReadLastChars() • GeoIP database • GUI - Only Close Button • GUI Examples • GUICtrlDeleteImage() • GUICtrlGetBkColor() • GUICtrlGetStyle() • GUIEvents • GUIGetBkColor() • Int_Parse() & Int_TryParse() • IsISBN() • LockFile() • Mapping CtrlIDs • OOP in AutoIt • ParseHeadersToSciTE() • PasswordValid • PasteBin • Posts Per Day • PreExpand • Protect Globals • Queue() • Resource Update • ResourcesEx • SciTE Jump • Settings INI • SHELLHOOK • Shunting-Yard • Signature Creator • Stack() • Stopwatch() • StringAddLF()/StringStripLF() • StringEOLToCRLF() • VSCROLL • WM_COPYDATA • More Examples... Updated: 22/04/2018 Link to comment Share on other sites More sharing options...
Mun Posted January 30, 2013 Author Share Posted January 30, 2013 After I asked about flag 4 returns, I went to help file and read up on it. That's smart to use flag 4 because you can keep the line index the same (and sub array for fetched info). Opened my mind* I always use flag 3 before. As for expression, I think I understood both expression now. I still think that is a cleverly impressive use of expression. Link to comment Share on other sites More sharing options...
Robjong Posted January 30, 2013 Share Posted January 30, 2013 (edited) ...use flag 4 because you can keep the line index the same (and sub array for fetched info)...Exactly.Added an explanation of the pattern to my as I said I would. Edited January 30, 2013 by Robjong Link to comment Share on other sites More sharing options...
Mun Posted January 31, 2013 Author Share Posted January 31, 2013 +1 Robjong for a detail explanation and thank you for your patience in teaching me. Your explanation is so in plain English and make it seem too easy - that make me like wtf why couldn't I do that or thought of that. >_< To be honest I learn a lot about regex within these 10 posts. I have documented for save keeping. Kudo to everyone here to spend your free time helping. Link to comment Share on other sites More sharing options...
Malkey Posted January 31, 2013 Share Posted January 31, 2013 For this recently deceased but still warm topic. Using Flag 3 and a tidy StringFormat layout. $s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF $s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF $s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF $s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF $s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF $s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF $s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF $s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF $s &= '"Invoiced: $32.00 Paid: $30.00"' & @CRLF $s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF ;Positive leading and trailing characters that must be present to capture the required, in-between number. ;Leading characters are "[$a-z\h]", and, trailing characters are "[\ha-z'"",]" ;$a = StringRegExp($s, "(?:.+?[$a-z\h])(\d+\.?\d*)(?:[\ha-z'"",].+?[$a-z\h])(\d+\.?\d*)(?:[\ha-z'"",]?.*)", 3) ;or ;Negative leading and trailing characters that must not be present to capture the required, in-between number. ;Leading and trailing characters are "[^():\-]". $a = StringRegExp($s, "(?:.+?[^():\-])(\d+\.?\d*)(?:[^():\-].+?[^():\-])(\d+\.?\d*)(?:[^():\-]?.*)", 3) ;Both regular expression pattern above capture the first two matching numbers in each line. A third matching number is ignored. ;The first matching number is the "Invoiced". The second matching number is the "Paid" value. For $i = 0 To UBound($a) - 1 Step 2 ConsoleWrite(StringFormat('Invoiced:%9s\tPaid: %8.2f\r\n', StringFormat("$%.2f", $a[$i]), $a[$i + 1])) Next Mun 1 Link to comment Share on other sites More sharing options...
PhoenixXL Posted January 31, 2013 Share Posted January 31, 2013 With the spirit of flag 3 a replace can further shorten the code;Original Code by Malkey $s = 'invoice 928.00 paid 880.00 pricing.' & @CRLF $s &= 'Invoice $ 35.20 Paid $ 31.12 Paid invoice per system pricing' & @CRLF $s &= 'inv 1681.00 pd 1575.00 no pay' & @CRLF $s &= 'Invoice $80.00 Paid $79.50 paid per g' & @CRLF $s &= '(2012-10-08:61516 ) Invoice $ 218.50 Paid $ 164.30 Paid invoice per system pricing' & @CRLF $s &= 'inv 220.89 pd 212.10 paid per pricing less.' & @CRLF $s &= 'Invoiced Amt $76, Paid $64.48 - paid as per flat fee' & @CRLF $s &= 'Invice64.00 Paid 63.50 Paid per admin pricing' & @CRLF $s &= 'Invoiced: $32.00 Paid: $30.00' & @CRLF $s &= 'Inv. $136 Pd. $126 per flat rate of $50 for' & @CRLF $s = StringRegExpReplace($s, '(?im)^[^i]+', '');Make the start from "I" $a = StringRegExp($s, '(?m)^(\D+)([\d.]+)(\D+)([\d.]+)', 3) For $i = 0 To UBound($a) - 1 Step 4 ConsoleWrite(StringFormat('Invoiced: %.2f Paid: %.2f\n', $a[$i + 1], $a[$i + 3])) Next Mun 1 My code: PredictText: Predict Text of an Edit Control Like Scite. Remote Gmail: Execute your Scripts through Gmail. StringRegExp:Share and learn RegExp.Run As System: A command line wrapper around PSEXEC.exe to execute your apps scripts as System (LSA). Database: An easier approach for _SQ_LITE beginners. MathsEx: A UDF for Fractions and LCM, GCF/HCF. FloatingText: An UDF for make your text floating. Clipboard Extendor: A clipboard monitoring tool. Custom ScrollBar: Scroll Bar made with GDI+, user can use bitmaps instead. RestrictEdit_SRE: Restrict text in an Edit Control through a Regular Expression. Link to comment Share on other sites More sharing options...
Mun Posted January 31, 2013 Author Share Posted January 31, 2013 Thank you Malkey/PhoenixXL for the alternate pattern. I think in this scenario I would rather use flag 4 to keep the index correct and I can look up reference invoices to match the pricing if I need to without wondering where the paid fall in that array. This was a great lesson for me with the alternate flag 3 default and flag 4 and I have learn to appreciate the use of it. You guys are tremendous teacher/helper. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now