Jump to content

Recommended Posts

  • Moderators
Posted

It is no secret I suck hard when it comes to regex. Usually I can get by with String functions just fine, but I am struggling at the moment. Hoping someone out there can provide some regex guidance for a simple solution.

I am extracting all the text from a PDF file for manipulation. The text, when extracted, comes out like this:

Agency Delegated Admin Request BDC Name: John Doe
Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &
Taylor Ins. Agency Number: 123456
Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator Jim Jones

I am basically looking to pull each field (BDC Name, Request Date, Agency Name, etc.) but as it the formatting it off I am not finding an easy way of capturing this. And the length of the field is not going to be consistent, so String functions with StringLen are getting me nowhere. Is there a simple method of pulling the fields? I thought about pulling everything between the colons, and then just removing the excess - so I would get this:

: John Doe Request Date
: 12/02/2014 Agency Name
: FUG Insurance Inc. dba Claribell & Taylor Ins. Agency Number
: 123456 Agency State
: TN Administrator Name
: Lu Ann Smith Administrator Phone
: 900-111-2222 Administrator Extension
: 111 Administrator Email
: myemail@ins.com Back-up Administrator Jim Jones

..and then would have to remove the next field's name from the string. But perhaps there is a better way that I am missing.

"Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball

How to get your question answered on this forum!

  • Moderators
Posted

JLogan3o13,

Are the various heading texts (i.e the words before the colon but after the previous data) always the same? :huh:

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

  • Moderators
Posted

JLogan3o13,

Then this seems to work:

$sString = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

; Remove all @CRLF and then replace the headers with @CRLF
$sExtract = StringRegExpReplace(StringReplace($sString, @CRLF, " "), "Agency Delegated Admin Request BDC Name: |Request Date: |Agency Name: |Agency Number: |Agency State: |Administrator Name: |Administrator Phone: |Administrator Extension: |Administrator Email: |Back-up Administrator: ", @CRLF)

ConsoleWrite($sExtract & @CRLF
No doubt a real guru will give you a better solution, but that should get you going. :)

M23

Public_Domain.png.2d871819fcb9957cf44f4514551a2935.png Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind

Open spoiler to see my UDFs:

Spoiler

ArrayMultiColSort ---- Sort arrays on multiple columns
ChooseFileFolder ---- Single and multiple selections from specified path treeview listing
Date_Time_Convert -- Easily convert date/time formats, including the language used
ExtMsgBox --------- A highly customisable replacement for MsgBox
GUIExtender -------- Extend and retract multiple sections within a GUI
GUIFrame ---------- Subdivide GUIs into many adjustable frames
GUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView items
GUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeView
Marquee ----------- Scrolling tickertape GUIs
NoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxes
Notify ------------- Small notifications on the edge of the display
Scrollbars ----------Automatically sized scrollbars with a single command
StringSize ---------- Automatically size controls to fit text
Toast -------------- Small GUIs which pop out of the notification area

 

  • Moderators
Posted

(.+?)(?::s*|z)

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

  • Moderators
Posted (edited)

Sorry, the vertical spaces were screwing it up, try this:  

"(?s)(?:(.+?)\:\s*|\z)" 

Edit:

You are going to have to replace the vertical spaces regardless to get a proper layout.

Local $aPatt = "(?s)(?:(.+?)\:\s*|\z)"
Local $aRegex = StringRegExp(StringRegExpReplace(ClipGet(), "\v+", " "), $aPatt, 3)
Edited by SmOke_N

Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.

Posted (edited)

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $res[10][2], $txt = $str  ;FileRead("1.txt")
Local $items[11] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator", "\z"]
For $i = 0 to 9
     $res[$i][0] = $items[$i]
     $res[$i][1] = StringRegExpReplace($txt, '(?s).*' & $items[$i] & ':\s*([^\r\n]+)\R?([^\r\n]+)?\s*' & $items[$i+1] & '.*', "$1$2")
Next

 _ArrayDisplay($res)

:)

Edit

This one will work even in case of missing info(s)

Edited by mikell
Posted

BTW this can be done with String* funcs and without regex  :)

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $res[10][2], $txt = $str  ;FileRead("1.txt")
Local $items[10] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator"]

$txt = StringReplace($txt, @crlf, " ")
For $i = 1 to 9
   $txt = StringReplace($txt, $items[$i], @crlf & $items[$i])
Next
Msgbox(0,"", $txt)
$lines = StringSplit($txt, @crlf, 1)
 _ArrayDisplay($lines)
For $i = 1 to $lines[0]
    $tmp = StringSplit($lines[$i], ": ", 1) 
    $res[$i-1][0] = $tmp[1]
    $res[$i-1][1] = $tmp[2]
Next
 _ArrayDisplay($res)
  • Moderators
Posted

Yes, i did much the same, mikell, stringsplit each line and then captured my content from there. But the PDFs are large and it was getting unwieldy.

Thanks, all, for the direction. I believe I've found what will work best.

"Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball

How to get your question answered on this forum!

Posted (edited)

Another way :

#Include <Array.au3>

Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _
    "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _
    "Taylor Ins. Agency Number: 123456" & @CRLF & _
    "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones"

Local $ret = StringRegExp(StringRegExpReplace($str, "\R", " "), "(?s)(Agency Delegated Admin Request BDC Name|Request Date|Agency Name|Agency Number|Agency State|Administrator Name|Administrator Phone|Administrator Extension|Administrator Email|Back-up Administrator): (.+?)\h*(?=(?1)|$)", 3)
_ArrayDisplay($ret)


; $ret2D = _Array1DTo2D($ret, 2) ; http://www.autoitscript.com/forum/topic/165600-array1dto2d/
; _ArrayDisplay($ret2D)



Local $aResult[ UBound($ret) / 2 ][2]
Local $iIndex = 0
For $i = 0 To UBound($ret) - 1 Step 2
    $aResult[$i / 2][0] = $ret[$i]

    $aResult[$i / 2][1] = $ret[$i + 1]
Next
_ArrayDisplay($aResult)
Edited by jguinch

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...