  • Moderators

It is no secret I suck hard when it comes to regex. Usually I can get by with String functions just fine, but I am struggling at the moment. Hoping someone out there can provide some regex guidance for a simple solution.

I am extracting all the text from a PDF file for manipulation. The text, when extracted, comes out like this:

Agency Delegated Admin Request BDC Name: John Doe
Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &
Taylor Ins. Agency Number: 123456
Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator Jim Jones

I am basically looking to pull each field (BDC Name, Request Date, Agency Name, etc.) but as it the formatting it off I am not finding an easy way of capturing this. And the length of the field is not going to be consistent, so String functions with StringLen are getting me nowhere. Is there a simple method of pulling the fields? I thought about pulling everything between the colons, and then just removing the excess - so I would get this:

: John Doe Request Date
: 12/02/2014 Agency Name
: FUG Insurance Inc. dba Claribell & Taylor Ins. Agency Number
: 123456 Agency State
: TN Administrator Name
: Lu Ann Smith Administrator Phone
: 900-111-2222 Administrator Extension
: 111 Administrator Email
: myemail@ins.com Back-up Administrator Jim Jones

..and then would have to remove the next field's name from the string. But perhaps there is a better way that I am missing.

  • Moderators


Are the various heading texts (i.e the words before the colon but after the previous data) always the same? :huh:


  • Moderators


Then this seems to work:

$sString = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

; Remove all @CRLF and then replace the headers with @CRLF
$sExtract = StringRegExpReplace(StringReplace($sString, @CRLF, " "), "Agency Delegated Admin Request BDC Name: |Request Date: |Agency Name: |Agency Number: |Agency State: |Administrator Name: |Administrator Phone: |Administrator Extension: |Administrator Email: |Back-up Administrator: ", @CRLF)

ConsoleWrite($sExtract & @CRLF
No doubt a real guru will give you a better solution, but that should get you going. :)


  • Moderators


  • Moderators
Posted (edited)

Sorry, the vertical spaces were screwing it up, try this:  



You are going to have to replace the vertical spaces regardless to get a proper layout.

Local $aPatt = "(?s)(?:(.+?)\:\s*|\z)"
Local $aRegex = StringRegExp(StringRegExpReplace(ClipGet(), "\v+", " "), $aPatt, 3)
Edited by SmOke_N

Posted (edited)

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $res[10][2], $txt = $str  ;FileRead("1.txt")
Local $items[11] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator", "\z"]
For $i = 0 to 9
     $res[$i][0] = $items[$i]
     $res[$i][1] = StringRegExpReplace($txt, '(?s).*' & $items[$i] & ':\s*([^\r\n]+)\R?([^\r\n]+)?\s*' & $items[$i+1] & '.*', "$1$2")




This one will work even in case of missing info(s)

Edited by mikell

BTW this can be done with String* funcs and without regex  :)

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $res[10][2], $txt = $str  ;FileRead("1.txt")
Local $items[10] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator"]

$txt = StringReplace($txt, @crlf, " ")
For $i = 1 to 9
   $txt = StringReplace($txt, $items[$i], @crlf & $items[$i])
Msgbox(0,"", $txt)
$lines = StringSplit($txt, @crlf, 1)
For $i = 1 to $lines[0]
    $tmp = StringSplit($lines[$i], ": ", 1) 
    $res[$i-1][0] = $tmp[1]
    $res[$i-1][1] = $tmp[2]
  • Moderators

Yes, i did much the same, mikell, stringsplit each line and then captured my content from there. But the PDFs are large and it was getting unwieldy.

Thanks, all, for the direction. I believe I've found what will work best.

"Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball

How to get your question answered on this forum!

Posted (edited)

Another way :

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $ret = StringRegExp(StringRegExpReplace($str, "\R", " "), "(?s)(Agency Delegated Admin Request BDC Name|Request Date|Agency Name|Agency Number|Agency State|Administrator Name|Administrator Phone|Administrator Extension|Administrator Email|Back-up Administrator): (.+?)\h*(?=(?1)|$)", 3)

; $ret2D = _Array1DTo2D($ret, 2) ; http://www.autoitscript.com/forum/topic/165600-array1dto2d/
; _ArrayDisplay($ret2D)

Local $aResult[ UBound($ret) / 2 ][2]
Local $iIndex = 0
For $i = 0 To UBound($ret) - 1 Step 2
    $aResult[$i / 2][0] = $ret[$i]

    $aResult[$i / 2][1] = $ret[$i + 1]
Edited by jguinch

