Moderators JLogan3o13 Posted December 11, 2014 Moderators Posted December 11, 2014 It is no secret I suck hard when it comes to regex. Usually I can get by with String functions just fine, but I am struggling at the moment. Hoping someone out there can provide some regex guidance for a simple solution. I am extracting all the text from a PDF file for manipulation. The text, when extracted, comes out like this: Agency Delegated Admin Request BDC Name: John Doe Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell & Taylor Ins. Agency Number: 123456 Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator Jim Jones I am basically looking to pull each field (BDC Name, Request Date, Agency Name, etc.) but as it the formatting it off I am not finding an easy way of capturing this. And the length of the field is not going to be consistent, so String functions with StringLen are getting me nowhere. Is there a simple method of pulling the fields? I thought about pulling everything between the colons, and then just removing the excess - so I would get this: : John Doe Request Date : 12/02/2014 Agency Name : FUG Insurance Inc. dba Claribell & Taylor Ins. Agency Number : 123456 Agency State : TN Administrator Name : Lu Ann Smith Administrator Phone : 900-111-2222 Administrator Extension : 111 Administrator Email : myemail@ins.com Back-up Administrator Jim Jones ..and then would have to remove the next field's name from the string. But perhaps there is a better way that I am missing. "Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball How to get your question answered on this forum!
Moderators Melba23 Posted December 11, 2014 Moderators Posted December 11, 2014 JLogan3o13,Are the various heading texts (i.e the words before the colon but after the previous data) always the same? M23 Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area
Moderators JLogan3o13 Posted December 11, 2014 Author Moderators Posted December 11, 2014 Yes, they are. "Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball How to get your question answered on this forum!
Moderators Melba23 Posted December 11, 2014 Moderators Posted December 11, 2014 JLogan3o13,Then this seems to work:$sString = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _ "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _ "Taylor Ins. Agency Number: 123456" & @CRLF & _ "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones" ; Remove all @CRLF and then replace the headers with @CRLF $sExtract = StringRegExpReplace(StringReplace($sString, @CRLF, " "), "Agency Delegated Admin Request BDC Name: |Request Date: |Agency Name: |Agency Number: |Agency State: |Administrator Name: |Administrator Phone: |Administrator Extension: |Administrator Email: |Back-up Administrator: ", @CRLF) ConsoleWrite($sExtract & @CRLFNo doubt a real guru will give you a better solution, but that should get you going. M23 Any of my own code posted anywhere on the forum is available for use by others without any restriction of any kind Open spoiler to see my UDFs: Spoiler ArrayMultiColSort ---- Sort arrays on multiple columnsChooseFileFolder ---- Single and multiple selections from specified path treeview listingDate_Time_Convert -- Easily convert date/time formats, including the language usedExtMsgBox --------- A highly customisable replacement for MsgBoxGUIExtender -------- Extend and retract multiple sections within a GUIGUIFrame ---------- Subdivide GUIs into many adjustable framesGUIListViewEx ------- Insert, delete, move, drag, sort, edit and colour ListView itemsGUITreeViewEx ------ Check/clear parent and child checkboxes in a TreeViewMarquee ----------- Scrolling tickertape GUIsNoFocusLines ------- Remove the dotted focus lines from buttons, sliders, radios and checkboxesNotify ------------- Small notifications on the edge of the displayScrollbars ----------Automatically sized scrollbars with a single commandStringSize ---------- Automatically size controls to fit textToast -------------- Small GUIs which pop out of the notification area
Moderators SmOke_N Posted December 11, 2014 Moderators Posted December 11, 2014 (.+?)(?::s*|z) Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.
Moderators JLogan3o13 Posted December 11, 2014 Author Moderators Posted December 11, 2014 That is awesome, thanks Melba "Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball How to get your question answered on this forum!
Moderators JLogan3o13 Posted December 11, 2014 Author Moderators Posted December 11, 2014 Thanks, Smoke_N, I will try that out as well "Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball How to get your question answered on this forum!
Moderators SmOke_N Posted December 11, 2014 Moderators Posted December 11, 2014 (edited) Sorry, the vertical spaces were screwing it up, try this: "(?s)(?:(.+?)\:\s*|\z)" Edit: You are going to have to replace the vertical spaces regardless to get a proper layout. Local $aPatt = "(?s)(?:(.+?)\:\s*|\z)" Local $aRegex = StringRegExp(StringRegExpReplace(ClipGet(), "\v+", " "), $aPatt, 3) Edited December 11, 2014 by SmOke_N Common sense plays a role in the basics of understanding AutoIt... If you're lacking in that, do us all a favor, and step away from the computer.
Moderators JLogan3o13 Posted December 11, 2014 Author Moderators Posted December 11, 2014 Thanks. You are right, it will take some massaging, but I can definitely work with it. "Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball How to get your question answered on this forum!
mikell Posted December 11, 2014 Posted December 11, 2014 (edited) #Include <Array.au3> Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _ "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _ "Taylor Ins. Agency Number: 123456" & @CRLF & _ "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones" Local $res[10][2], $txt = $str ;FileRead("1.txt") Local $items[11] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator", "\z"] For $i = 0 to 9 $res[$i][0] = $items[$i] $res[$i][1] = StringRegExpReplace($txt, '(?s).*' & $items[$i] & ':\s*([^\r\n]+)\R?([^\r\n]+)?\s*' & $items[$i+1] & '.*', "$1$2") Next _ArrayDisplay($res) Edit This one will work even in case of missing info(s) Edited December 11, 2014 by mikell
mikell Posted December 11, 2014 Posted December 11, 2014 BTW this can be done with String* funcs and without regex #Include <Array.au3> Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _ "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _ "Taylor Ins. Agency Number: 123456" & @CRLF & _ "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones" Local $res[10][2], $txt = $str ;FileRead("1.txt") Local $items[10] = ["Agency Delegated Admin Request BDC Name", "Request Date", "Agency Name", "Agency Number", "Agency State", "Administrator Name", "Administrator Phone", "Administrator Extension", "Administrator Email", "Back-up Administrator"] $txt = StringReplace($txt, @crlf, " ") For $i = 1 to 9 $txt = StringReplace($txt, $items[$i], @crlf & $items[$i]) Next Msgbox(0,"", $txt) $lines = StringSplit($txt, @crlf, 1) _ArrayDisplay($lines) For $i = 1 to $lines[0] $tmp = StringSplit($lines[$i], ": ", 1) $res[$i-1][0] = $tmp[1] $res[$i-1][1] = $tmp[2] Next _ArrayDisplay($res)
Moderators JLogan3o13 Posted December 11, 2014 Author Moderators Posted December 11, 2014 Yes, i did much the same, mikell, stringsplit each line and then captured my content from there. But the PDFs are large and it was getting unwieldy. Thanks, all, for the direction. I believe I've found what will work best. "Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball How to get your question answered on this forum!
Bert Posted December 11, 2014 Posted December 11, 2014 Have you looked at the RegEx tool script in the help file? That thing makes using RegEx much easier. The Vollatran project My blog: http://www.vollysinterestingshit.com/
jguinch Posted December 11, 2014 Posted December 11, 2014 (edited) Another way : #Include <Array.au3> Local $str = "Agency Delegated Admin Request BDC Name: John Doe" & @CRLF & _ "Request Date: 12/02/2014 Agency Name: FUG Insurance Inc. dba Claribell &" & @CRLF & _ "Taylor Ins. Agency Number: 123456" & @CRLF & _ "Agency State: TN Administrator Name: Lu Ann Smith Administrator Phone: 900-111-2222 Administrator Extension: 111 Administrator Email: myemail@ins.com Back-up Administrator: Jim Jones" Local $ret = StringRegExp(StringRegExpReplace($str, "\R", " "), "(?s)(Agency Delegated Admin Request BDC Name|Request Date|Agency Name|Agency Number|Agency State|Administrator Name|Administrator Phone|Administrator Extension|Administrator Email|Back-up Administrator): (.+?)\h*(?=(?1)|$)", 3) _ArrayDisplay($ret) ; $ret2D = _Array1DTo2D($ret, 2) ; http://www.autoitscript.com/forum/topic/165600-array1dto2d/ ; _ArrayDisplay($ret2D) Local $aResult[ UBound($ret) / 2 ][2] Local $iIndex = 0 For $i = 0 To UBound($ret) - 1 Step 2 $aResult[$i / 2][0] = $ret[$i] $aResult[$i / 2][1] = $ret[$i + 1] Next _ArrayDisplay($aResult) Edited December 12, 2014 by jguinch Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now