Jump to content

Organize a text file by comparing it to another


GimK
 Share

Recommended Posts

This is my first post, so first of all hello everyone !

I have already looked a little bit everywhere to find an answer to my question, but if I missed it please redirect me :)

So, actually I try to organize a text file into an array by comparing it to another. One of my files is a messy document full of text, spaces, tabulations, that I got from copying a form. The other file is a list of every title of the form.

To make it clear, here is an example :

File 1 :

Name:Antony    Lastname : Kob
Age    15       height   :1.95      Hobbies  football, tennis, autoit

File 2 : 

Name
Lastname
Age
Height
Hobbies

This is of course way more simple that what I have, but the principle is here. In the end, I would like an array with all the content of the file 1 organised like that :

Name
Antony
Lastname
Kob
Age
15
Height
1.95
Hobbies
football
tennis
autoit

How can I do that ? Thanks !

EDIT: I forgot to say that sometimes the title is composed of multiple words, like "Owned By :" for example, and the following text can be empty.

Edited by GimK
Link to comment
Share on other sites

Up !

I managed to do a part of the code actually.

#include <MsgBoxConstants.au3>
#include <StringConstants.au3>
#include <AutoItConstants.au3>
#include <FileConstants.au3>
#include <Array.au3>
#include <File.au3>

HotKeySet("{END}", "Terminate")

Local $formTitlesPath = @ScriptDir & "\FormTitles.txt"
Local $formTitles
Local $all
Local $current


If NOT (_FileReadToArray($formTitlesPath, $formTitles)) Then
  fileMsgBox(@error, "FormTitles.txt")
  Terminate()
EndIf

Local $testFile = FileOpen(@ScriptDir & "\test.txt")
If ($testFile == -1)  Then
  MsgBox(0, "Oops, there's an error", "Can't open test file")
  Terminate()
EndIf

$all = FileRead($testFile)


Local $charPos
Local $finalSize = 2*UBound($formTitles)
Local $finalArray[$finalSize]

While 1
  
    For $i = 1 To UBound($formTitles)-1
        $current = $formTitles[$i]
        $finalArray[2*$i] = $current
    
        $charPos = StringinStr($all, $current) + StringLen($current)
        $finalArray[(2*$i)+1] = $charPos
     Next
_ArrayDisplay($finalArray)

WEnd
Terminate()


Func Terminate()
  Exit
EndFunc   ;==>Terminate

;File opening error function
Func fileMsgBox($error, $file)
  MsgBox(0, "Oops, there's an error type " & $error, "Can't open the '" & $file & "' file.")
EndFunc

But this should only create an duplicate of the $formTitle array with spaces between each, and, I believe, the starting position of what is between each title.

However, regarding to the result, the position seem wrong. And I can't figure out how to catch what is in there..

Edited by GimK
Link to comment
Share on other sites

Just a try

#Include <Array.au3>

$txt = "  Owned By :        Name:Antony    Lastname : Kob" & @crlf & _
    "Age    15       height   :1.95      Hobbies  football, tennis, autoit"

$ref = "Owned By|Name|Lastname|Age|Height|Hobbies"

$txt = StringReplace(StringStripWS($txt, 3), @crlf, @TAB)
$txt1 = StringRegExpReplace($txt, '(?i)(?<!^|\w)(?=' & $ref & ')|(?<=' & $ref & ')\h*:?', @crlf) 
; Msgbox(0,"1", $txt1)

$res = StringSplit($txt1, @crlf, 3)
Local $array[UBound($res)/2][2]
For $i = 0 to UBound($res)-1 step 2
    $array[$i/2][0] = $res[$i]
    $array[$i/2][1] = StringStripWS($res[$i+1], 3)
Next
_ArrayDisplay($array)

 

Link to comment
Share on other sites

Hi ! Thank you for the answer.

Sorry I'm pretty new with AutoIt, so I don't understand everything. Could you explain roughly what you do ? Even with the function reference of StringRegExp, I don't really understand your pattern. Following either..

Thanks for your help !

 

Link to comment
Share on other sites

Regular Expressions aren't easy to understand until you work with them on a daily basis. That's at least my impression.

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2024-07-28 - Version 1.6.3.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to comment
Share on other sites

water, your impression is sooo correct  :)

GimK,
The first String* funcs are easy to understand
Explanations for the StringRegExpReplace :

'(?i)(?<!^|\w)(?=' & $ref & ')|(?<=' & $ref & ')\h*:?'

(?i)    : case insensitive
(?<! )  : negative lookbehind, means 'not preceded by'
    ^|\w  : beginning of string OR a word char
(?=' & $ref & ')  : positive lookahead, means 'followed by' (by the content of the $ref variable)
|    : or (alternation)
(?<=' & $ref & ')  : positive lookbehind, means 'preceded by' (by the content of the $ref variable)
\h*:?  : 0 or more horizontal whitespace + an optional colon


$ref = "Owned By|Name|Lastname|Age|Height|Hobbies"  :
    This string contains the subpattern with the keywords alternation
    It means ("Owned By" OR "Name" OR "Lastname" ... etc )

So in usual language this regex says :

" Find
- positions (not preceded by the beginning of string OR by a word char)  ; because "name" must not match in "Lastname"
                 and (followed by a keyword)
or
- some horizontal spaces (or none) with a colon (or not) preceded by a keyword

And replace them by a @crlf "

 

Edited by mikell
Link to comment
Share on other sites

Alright, thanks !

I think I understood, the rest is clear now ! (Sorry for the delay, couldn't work on it this week end.)

water you should be right, because this looks a little bit like Brainfuck for me at the moment ;)

Link to comment
Share on other sites

If NOT (_FileReadToArray($formTitlesPath, $formTitles)) Then
  fileMsgBox(@error, "FormTitles.txt")
  Terminate()
EndIf

Local $testFile = FileOpen(@ScriptDir & "\test.txt")
If ($testFile == -1)  Then
  MsgBox(0, "Oops, there's an error", "Can't open test file")
  Terminate()
EndIf

$all = FileRead($testFile)
Local $fAll
Local $res

$formTitles = _ArrayToString($formTitles, "|")
$all = StringReplace(StringStripWS($all, 3), @crlf, @TAB)
$fAll = StringRegExpReplace($all, '(?i)(?<!^|\w)(?=' & $formTitles & ')|(?<=' & $formTitles & ')\h*:?', @crlf)
Msgbox(0,"1", $formTitles)
MsgBox(0,"1", $fAll)

$res = StringSplit($fAll, @crlf, 3)
_ArrayDisplay($res)

Local $array[UBound($res)/2][2]
For $i = 0 to UBound($res)-1 step 2
    $array[$i/2][0] = $res[$i]
    $array[$i/2][1] = StringStripWS($res[$i+1], 3)
Next
_ArrayDisplay($array)

Terminate()

Well, I still have an issue. The list of titles seems okay, as well as the $fAll string (= $text1). But the $res array have only his fist column filled, with all titles and answers without any WS. And I guess that is why I got an "Array variable has incorrect number of subscripts or subscript dimension range exceeded : $array[$i/2][0] = $res[$i]
^ ERROR"

I don't see where it's coming from ?

 

Link to comment
Share on other sites

I still have the same error..

I changed this line 

$res = StringSplit($fAll, @crlf, 3)

in this

$res = StringSplit($fAll, @TAB, 3)

And I have now a readable array in $res, even if there is a lot of blank lines, and the same dimension error with $array..

But I have to admit I don't understand what is happening, since this

$all = StringReplace(StringStripWS($all, 3), @crlf, @TAB)
$fAll = StringRegExpReplace($all, '(?i)(?<!^|\w)(?=' & $formTitles & ')|(?<=' & $formTitles & ')\h*:?', @crlf)

should put @crlf between each, and not @TAB, right ? Excepted if the StringRegExpReplace() doesn't work right

Edited by GimK
Link to comment
Share on other sites

Hum regex need accuracy
The pattern in post #3 was intended to work on your sample text 'File1' in post #1
So if you are currently using a different text, could you please post the exact copy of the current content of "test.txt" ?

BTW the regex uses @crlf as a delimiter for the output, so if one or more @crlf already exist in the original text it must be removed first (reason why I replaced it by a tab)

Link to comment
Share on other sites

OMG
I dreaded something like this
Where does this text come from ? a web page ? if so there is certainly a better / easier / more reliable way to go


Edit
OK the problem was in the file "FormTitles.txt" with some titles containing either special characters or typos
Please use the one below, as is, and this code

#include <Array.au3>
#include <File.au3>

Local $formTitlesPath = @ScriptDir & "\FormTitles.txt"
Local $formTitles
If NOT (_FileReadToArray($formTitlesPath, $formTitles)) Then
  MsgBox(@error, "FormTitles.txt")
  Terminate()
EndIf

Local $titles 
For $i = 1 to $formTitles[0]
   $titles &= "\Q" & $formTitles[$i] & "\E|"
Next
$titles = StringTrimRight($titles, 1)
; Msgbox(0,"1", $titles)

Local $testFile = FileOpen(@ScriptDir & "\test.txt")
If ($testFile == -1)  Then
  MsgBox(0, "Oops, there's an error", "Can't open test file")
  Terminate()
EndIf

$all = FileRead($testFile)
$all = StringReplace(StringStripWS($all, 3), @crlf, @TAB)
$fAll = StringRegExpReplace($all, '(?i)(?<!^|\w)(?=' & $titles & ')|(?<=' & $titles & ')\h*:?', @crlf)
; MsgBox(0,"1", $fAll)

$res = StringSplit($fAll, @crlf, 3)
; _ArrayDisplay($res)

Local $array[UBound($res)/2][2]
For $i = 0 to UBound($res)-1 step 2
    $array[$i/2][0] = $res[$i]
    $array[$i/2][1] = StringStripWS($res[$i+1], 3)
Next
_ArrayDisplay($array)


Func Terminate()
  Exit
EndFunc   ;==>Terminate

FormTitles.txt

Edited by mikell
Link to comment
Share on other sites

Nope, it comes from IBM Notes, a collaboration platform. I looked for COM or any way to gather the data but I didn't succeed..

Thanks a lot !

The FormTitles.txt you gave me is the same as the one I got, maybe it is the wrong one ? Because I still have the same error as before..

EDIT: Oh my bad, I forgot to change the parameters of _FileReadToArray. This is working perfectly ! Thank you a lot, I don't know what I would have done without your help.

 

Edited by GimK
Link to comment
Share on other sites

Glad I could help  :)   (© M23)

BTW FormTitles.txt looks the same but is not exactly the same
Example : there was a missing space in "Drawing Title : (Match Drawing Title)" and as regex require a perfect accuracy such a typo is enough to make the whole thing fail...

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...