Jump to content

RegEx Extra Capturing Groups


Recommended Posts

I've built this Regular expression that works great on https://regex101.com/, but when I enter it and the same data into AutoIt, it gives me 4 extra empty capturing groups before each set of data. Can someone explain to me why this happens and possibly how to fix it?

(?i)(?m)
(?(DEFINE)

(?<NS>(?:[^ \n]+))
(?<PaymentType>(?:Pharmacy|Hospital Costs|Physical Therapy costs|Medical Payment|Physician Payment|Medical Supplies, DME|Bill Review|Network Access Fee|Chiropractic Expenses))
(?<Money>\$[\d,]*\.\d*)
(?<Date>\d{1,2}\/\d{1,2}\/\d{2,4})

)

^((?&NS)) ((?&NS)) ((?&Date)) ((?&NS)) ((?&NS)) (.*) ((?&PaymentType)) ((?&Money)) ((?&Money))

NS stands for "no spaces". I added the newlines to help make the definitions more visible. I'm splitting data out of a PDF and use this RegEx to turn it into a CSV.

Here's some random test data

Spoiler

 

Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9
784070 475086 2/21/2019 951612 774400 19 Some text Network Access Fee $9,818.00 $9,818.00
321538 697220 10/16/2019 584345 157837 90 Some text Medical Supplies, DME $4,893.00 $4,893.00
717049 131510 11/24/2019 591540 434357 80 Some text Hospital Costs $9,890.00 $9,890.00
441658 578030 1/6/2019 920334 593618 92 Some text Network Access Fee $2,912.00 $2,912.00
934772 726402 12/27/2019 262470 659210 41 Some text Network Access Fee $3,515.00 $3,515.00
456371 782567 3/22/2019 232286 569047 76 Some text Bill Review $845.00 $845.00
733793 243027 10/24/2019 827310 509902 30 Some text Physician Payment $9,401.00 $9,401.00
446456 289749 12/14/2019 399924 975049 73 Some text Physical Therapy costs $5,212.00 $5,212.00
657106 762255 6/13/2019 858558 157695 53 Some text Medical Payment $5,931.00 $5,931.00
631262 523757 12/10/2019 221874 270665 85 Some text Medical Supplies, DME $592.00 $592.00
705439 821105 7/9/2019 807562 429778 32 Some text Bill Review $1,802.00 $1,802.00

 

All my code provided is Public Domain... but it may not work. ;) Use it, change it, break it, whatever you want.

Spoiler

My Humble Contributions:
Personal Function Documentation - A personal HelpFile for your functions
Acro.au3 UDF - Automating Acrobat Pro
ToDo Finder - Find #ToDo: lines in your scripts
UI-SimpleWrappers UDF - Use UI Automation more Simply-er
KeePass UDF - Automate KeePass, a password manager
InputBoxes - Simple Input boxes for various variable types

Link to comment
Share on other sites

It seems that each of your DEFINE adds an empty group. You got 4 DEFINE's => 4 empty groups.

Examples with less DEFINE... 1 Match text line :

784070 475086 2/21/2019 951612 774400 19 Some text Network Access Fee $9,818.00 $9,818.00

2 DEFINE's => 2 empty groups

(?i)(?m)(?(DEFINE)(?<NS>(?:[^ \n]+))(?<Date>\d{1,2}\/\d{1,2}\/\d{2,4}))^((?&NS)) ((?&NS)) ((?&Date))
0:
1:
2: 784070
3: 475086
4: 2/21/2019

1 DEFINE => 1 empty group

(?i)(?m)(?(DEFINE)(?<NS>(?:[^ \n]+)))^((?&NS))
0:
1: 784070

I tried what follows to get rid of the empty group, it works but I don't know why :

1 DEFINE => no empty group

(?i)(?m)(?(DEFINE)(?<NS>(?:[^ \n]+)))^(?&NS)
0: 784070

Hope it will help you to solve your problem :)

Edit: what follows shows that it's not because you got 4 ((?&NS)) in your search pattern that there are 4 empty groups.

1 DEFINE => 1 empty group (though there are 2 ((?&NS) in search pattern

(?i)(?m)(?(DEFINE)(?<NS>(?:[^ \n]+)))^((?&NS)) ((?&NS))
0: 
1: 784070
2: 475086
Edited by pixelsearch
Link to comment
Share on other sites

Thanks! At least I can just replace all of the groups for now with their definitions. I would love to know why they're being captured though 😐

Edited by seadoggie01

All my code provided is Public Domain... but it may not work. ;) Use it, change it, break it, whatever you want.

Spoiler

My Humble Contributions:
Personal Function Documentation - A personal HelpFile for your functions
Acro.au3 UDF - Automating Acrobat Pro
ToDo Finder - Find #ToDo: lines in your scripts
UI-SimpleWrappers UDF - Use UI Automation more Simply-er
KeePass UDF - Automate KeePass, a password manager
InputBoxes - Simple Input boxes for various variable types

Link to comment
Share on other sites

That's a "feature" of the legacy PCRE versions.  The currently supported branch is PCRE2 and is a complete rewrite of the library.

Unfortunately AutoIt is still using the legacy version.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

3 hours ago, jchd said:

feature

Ugh, okay, thank you! I'll just write my code around it 🙄

All my code provided is Public Domain... but it may not work. ;) Use it, change it, break it, whatever you want.

Spoiler

My Humble Contributions:
Personal Function Documentation - A personal HelpFile for your functions
Acro.au3 UDF - Automating Acrobat Pro
ToDo Finder - Find #ToDo: lines in your scripts
UI-SimpleWrappers UDF - Use UI Automation more Simply-er
KeePass UDF - Automate KeePass, a password manager
InputBoxes - Simple Input boxes for various variable types

Link to comment
Share on other sites

A funny way to work around (and BTW get a csv)  :)

#Include <Array.au3>

$txt = "Column 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9" & @crlf & _ 
    "784070 475086 2/21/2019 951612 774400 19 Some text Network Access Fee $9,818.00 $9,818.00" & @crlf & _ 
    "321538 697220 10/16/2019 584345 157837 90 Some text Medical Supplies, DME $4,893.00 $4,893.00" & @crlf & _ 
    "717049 131510 11/24/2019 591540 434357 80 Some text Hospital Costs $9,890.00 $9,890.00" & @crlf & _ 
    "441658 578030 1/6/2019 920334 593618 92 Some text Network Access Fee $2,912.00 $2,912.00" & @crlf & _ 
    "934772 726402 12/27/2019 262470 659210 41 Some text Network Access Fee $3,515.00 $3,515.00" & @crlf & _ 
    "456371 782567 3/22/2019 232286 569047 76 Some text Bill Review $845.00 $845.00" & @crlf & _ 
    "733793 243027 10/24/2019 827310 509902 30 Some text Physician Payment $9,401.00 $9,401.00" & @crlf & _ 
    "446456 289749 12/14/2019 399924 975049 73 Some text Physical Therapy costs $5,212.00 $5,212.00" & @crlf & _ 
    "657106 762255 6/13/2019 858558 157695 53 Some text Medical Payment $5,931.00 $5,931.00" & @crlf & _ 
    "631262 523757 12/10/2019 221874 270665 85 Some text Medical Supplies, DME $592.00 $592.00" & @crlf & _ 
    "705439 821105 7/9/2019 807562 429778 32 Some text Bill Review $1,802.00 $1,802.00"

$p = '(?(DEFINE)' & _ 
    '(?<NS>\d+)' & _
    '(?<Date>\d{1,2}/\d{1,2}/\d{2,4})' & _ 
    '(?<PaymentType>Pharmacy|Hospital Costs|Physical Therapy costs|Medical Payment|Physician Payment|Medical Supplies, DME|Bill Review|Network Access Fee|Chiropractic Expenses)' & _ 
    '(?<Money>\$[\d,]*\.\d*))' & _
    '(?im)^((?&NS)) ((?&NS)) ((?&Date)) ((?&NS)) ((?&NS)) (.+) ((?&PaymentType)) ((?&Money)) ((?&Money))'

$s = StringRegExpReplace($txt, $p, "$5;$6;$7;$8;$9;${10};${11};${12};${13}")
Msgbox(0,"", $s)

 

Link to comment
Share on other sites

On 1/10/2020 at 4:53 PM, mikell said:

A funny way to work around (and BTW get a csv)  :)

Thanks! I never would've thought of that. I was trying to keep in an array to input it into Excel, but I could just use text to columns for that.

All my code provided is Public Domain... but it may not work. ;) Use it, change it, break it, whatever you want.

Spoiler

My Humble Contributions:
Personal Function Documentation - A personal HelpFile for your functions
Acro.au3 UDF - Automating Acrobat Pro
ToDo Finder - Find #ToDo: lines in your scripts
UI-SimpleWrappers UDF - Use UI Automation more Simply-er
KeePass UDF - Automate KeePass, a password manager
InputBoxes - Simple Input boxes for various variable types

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...