leuce Posted March 19, 2021 Share Posted March 19, 2021 (edited) Hello everyone I have a text file with lines of text, and some lines occur more than once. I would like to delete all lines that occur more than once (i.e. including all instances of that particular line). The lines are already sorted alphabetically. The problem is that I can't quite figure out how to write the regular expression. #include <Array.au3> $fo = FileOpen ("testfile.txt", 128) $fwo = FileOpen ($fo & "_output.txt", 129) $fr = FileRead ($fo) $freg = StringRegExpReplace ($fr, '(.+?\R){2,}', '') ; I also tried: ; $freg = StringRegExpReplace ($fr, '(.+?\R)+', '') ; $freg = StringRegExpReplace ($fr, '(.+?' & @CRLF & '){2,}', '') ; $freg = StringRegExpReplace ($fr, '^(.+?\R){2,}', '') ; $freg = StringRegExpReplace ($fr, '(?m)^(.+?\R){2,}', '') FileWrite ($fwo, $freg) In the test file attached, only one line ("The quick brown fox.") should remain in the file. Thanks Samuel PS. The alternative approach is to use an array and compare array items with each other in a series of loops, but I'm hoping the regex solution is viable. testfile.txt Edited March 19, 2021 by leuce Link to comment Share on other sites More sharing options...
TheXman Posted March 19, 2021 Share Posted March 19, 2021 (edited) <snip> Sorry, I misread the original request. Edited March 19, 2021 by TheXman Changed to use the text file data as-is CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
TheXman Posted March 19, 2021 Share Posted March 19, 2021 (edited) <snip> Sorry, I misread the original request. Edited March 19, 2021 by TheXman CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
AspirinJunkie Posted March 19, 2021 Share Posted March 19, 2021 (edited) Maybe a slightly simpler pattern: $sNew = StringRegExpReplace(FileRead("testfile.txt"), '(?ms)^(\V+)$.*\1\R', '') ConsoleWrite($sNew) But it only works if the rows are already sorted - as you wrote. Edited March 19, 2021 by AspirinJunkie Link to comment Share on other sites More sharing options...
TheXman Posted March 19, 2021 Share Posted March 19, 2021 (edited) <snip> Sorry, I read the original request wrong. Edited March 19, 2021 by TheXman CryptoNG UDF: Cryptography API: Next Gen jq UDF: Powerful and Flexible JSON Processor | jqPlayground: An Interactive JSON Processor Xml2Json UDF: Transform XML to JSON | HttpApi UDF: HTTP Server API | Roku Remote: Example Script About Me How To Ask Good Questions On Technical And Scientific Forums (Detailed) | How to Ask Good Technical Questions (Brief) "Any fool can know. The point is to understand." -Albert Einstein "If you think you're a big fish, it's probably because you only swim in small ponds." ~TheXman Link to comment Share on other sites More sharing options...
AspirinJunkie Posted March 19, 2021 Share Posted March 19, 2021 Just now, TheXman said: Is that not what you get? Sure - what's the problem with this? That's what leuce wanted: 34 minutes ago, leuce said: In the test file attached, only one line ("The quick brown fox.") should remain in the file. But your result could also be achieved with a slightly shorter pattern: $sNew = StringRegExpReplace(FileRead("testfile.txt"), '(?ms)^(\V+)$\R(?=.*\1)', '') ConsoleWrite($sNew) TheXman 1 Link to comment Share on other sites More sharing options...
leuce Posted March 19, 2021 Author Share Posted March 19, 2021 (edited) 44 minutes ago, AspirinJunkie said: Maybe a slightly simpler pattern: $sNew = StringRegExpReplace(FileRead("testfile.txt"), '(?ms)^(\V+)$.*\1\R', '') ConsoleWrite($sNew) But it only works if the rows are already sorted - as you wrote. Thanks, that regex works for the short test file and it works for slightly longer test files too, but then on one specific longer test file it fails near the middle of the file for no immediately apparent reason (larger test file attached). By "fail" I mean it deletes about 60 lines that are not duplicates. (Added: I see both the first and the last line in the group of lines that is erroneously deleted end on the same word). Ideally the script should rather delete too few than too many lines (i.e. if there are any lines that are not properly sorted, then those lines should just be ignored). I'll see if I can figure out what happens with the long test file. But thanks again for the regex help -- it definitely is not my strong point. Samuel testfile.txt Edited March 19, 2021 by leuce Link to comment Share on other sites More sharing options...
leuce Posted March 19, 2021 Author Share Posted March 19, 2021 (edited) In the mean time, I figured out how to do this using array item comparisons and loops instead of regular expressions. I know this is not relevant to regex but I thought I'd post it since this was my second option at a solution to my overall problem. #include <Array.au3> $j = 0 $fo = FileOpen ("testfile.txt", 128) $fws = FileOpen ($fo & "_sorted.txt", 129) $fwo = FileOpen ($fo & "_output.txt", 129) ; one can then compare these two files in e.g. WinMerge $fr = FileRead ($fo) $farr = StringSplit ($fr, @CRLF, 1) $count = $farr[0] _ArraySort($farr, 0, 0, 0, 0, 1) $fstr = _ArrayToString ($farr, @CRLF) FileWrite ($fws, $fstr) While $j < $count If $farr[$j] = $farr[$j+1] Then $farr[$j] = "x" $j = $j + 1 If $j <> $count Then If $farr[$j] <> $farr[$j+1] Then $farr[$j] = "x" EndIf EndIf Else $j = $j + 1 EndIf WEnd $fstr2 = _ArrayToString ($farr, @CRLF) FileWrite ($fwo, $fstr2) Edited March 19, 2021 by leuce Link to comment Share on other sites More sharing options...
Factfinder Posted March 19, 2021 Share Posted March 19, 2021 How about this one: $freg = StringRegExpReplace (FileRead("testfile.txt"), '(?s)(.+?\R)\1+', '') Link to comment Share on other sites More sharing options...
leuce Posted March 19, 2021 Author Share Posted March 19, 2021 10 minutes ago, Factfinder said: How about this one: $freg = StringRegExpReplace (FileRead("testfile.txt"), '(?s)(.+?\R)\1+', '') I tested this one on two files... but I really must go to bed now... and it seems to work, thanks! Link to comment Share on other sites More sharing options...
pseakins Posted March 19, 2021 Share Posted March 19, 2021 My philosophy is to avoid RegEx at all costs. Probably unfounded. This stems from a bitter disagreement with colleagues some years ago. Anyway, this is how I would solve this problem. It works. and gets rid of the blank line. #include <Array.au3> #include <File.au3> Dim $aText _FileReadToArray("C:\Users\user\Downloads\testfile.txt", $aText) _ArrayDisplay($aText) _ArraySort($aText, Default, 1) ; optional, as apparently the text is already sorted _ArrayDisplay($aText) $i = 1 While $i <= $aText[0] - 1 If $aText[$i] = $aText[$i + 1] Or $aText[$i] = "" Then _ArrayDelete($aText, $i) $aText[0] -= 1 Else $i += 1 EndIf WEnd _ArrayDisplay($aText) _FileWriteFromArray("C:\Users\user\Downloads\testfileout.txt", $aText, 1) Phil Seakins Link to comment Share on other sites More sharing options...
Musashi Posted March 20, 2021 Share Posted March 20, 2021 @leuce ! If you prefer arrays, then you could also consider _ArrayUnique : #include <File.au3> #include <Array.au3> Global $aArrSource, $aArrUnique If Not _FileReadToArray(@ScriptDir & "\testfile.txt", $aArrSource, $FRTA_NOCOUNT) Then Exit MsgBox(BitOR(4096, 16), "Message : ", "Error : _FileReadToArray") _ArraySort($aArrSource, Default, 0) ; optional : sort array $aArrUnique = _ArrayUnique($aArrSource, 0, 0, 1, $ARRAYUNIQUE_NOCOUNT) _FileWriteFromArray(@ScriptDir & "\testfile_unique.txt", $aArrUnique) FrancescoDiMuro 1 "In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move." Link to comment Share on other sites More sharing options...
AspirinJunkie Posted March 20, 2021 Share Posted March 20, 2021 (edited) 9 hours ago, leuce said: Thanks, that regex works for the short test file and it works for slightly longer test files too, but then on one specific longer test file it fails near the middle of the file for no immediately apparent reason (larger test file attached). By "fail" I mean it deletes about 60 lines that are not duplicates. Yes this is because this string of the line appears again later as part of another line. You can fix this by writing a ^ in front of the \1 in the pattern. However, the following pattern is much more efficient (FactFinders Pattern is good too for the result but expensive): $sNew = StringRegExpReplace(FileRead("testfile.txt"), "(?m)^(.+\R)\1+", '') ConsoleWrite($sNew) @pseakins, @Musashi He wanted to completely delete the lines that appear several times - not only the doubles. In your solutions only the duplicates are removed - one line of them still remain. Edited March 20, 2021 by AspirinJunkie Link to comment Share on other sites More sharing options...
Musashi Posted March 20, 2021 Share Posted March 20, 2021 36 minutes ago, AspirinJunkie said: He wanted to completely delete the lines that appear several times - not only the doubles. In your solutions only the duplicates are removed - one line of them still remain. Regarding this point, according to his description(s), I was not quite sure about it anyway. The title reads "... remove all duplicate lines...". The post states "I would like to delete all lines that occur more than once...". You're probably right, though. This would therefore mean that : Element A Element B Element B Element C becomes : Element A Element C and not : Element A Element B Element C "In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move." Link to comment Share on other sites More sharing options...
pseakins Posted March 20, 2021 Share Posted March 20, 2021 (edited) 2 hours ago, AspirinJunkie said: He wanted to completely delete the lines that appear several times - not only the doubles. Yes, I totally missed that requirement. Here's my second attempt. *** It doesn't work, please ignore *** #include <Array.au3> #include <File.au3> Dim $aText $sPrevLine = "xyzplugh" _FileReadToArray("C:\Users\user\Downloads\testfile.txt", $aText) _ArrayDisplay($aText) _ArraySort($aText, Default, 1) _ArrayDisplay($aText) $i = 1 While $i <= $aText[0] - 1 ; enable one or the other of the next two lines depending if you want to delete null lines ; If $aText[$i] = $aText[$i + 1] Or $aText[$i] = $sPrevLine Or $aText[$i] = "" Then If $aText[$i] = $aText[$i + 1] Or $aText[$i] = $sPrevLine Then $sPrevLine = $aText[$i] _ArrayDelete($aText, $i) $aText[0] -= 1 Else $i += 1 EndIf WEnd _ArrayDisplay($aText) _FileWriteFromArray("C:\Users\user\Downloads\testfileout.txt", $aText, 1) Ignore this last post of mine - it is rubbish - my code does not work correctly. Edited March 20, 2021 by pseakins irrelevant post Phil Seakins Link to comment Share on other sites More sharing options...
Musashi Posted March 20, 2021 Share Posted March 20, 2021 @leuce : If you want to remove all lines that appear multiple times, then a regular expression (see examples from @Factfinder or @AspirinJunkie) is probably most suitable : #include <WinAPIFiles.au3> #include <FileConstants.au3> Global $hSourceFile = FileOpen(@ScriptDir & "\testfile.txt", BitOR($FO_READ, $FO_UTF8)) If $hSourceFile = -1 Then Exit MsgBox(BitOR(4096, 16), "Message : ", "Error : reading the file") Global $hTargetFile = FileOpen(@ScriptDir & "\testfile_target.txt", BitOR($FO_OVERWRITE, $FO_UTF8)) If $hTargetFile = -1 Then Exit MsgBox(BitOR(4096, 16), "Message : ", "Error : writing the file") FileWrite($hTargetFile, StringRegExpReplace(FileRead($hSourceFile), "(?m)^(.+\R)\1+", '')) FileClose($hTargetFile) FileClose($hSourceFile) "In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move." Link to comment Share on other sites More sharing options...
Nine Posted March 20, 2021 Share Posted March 20, 2021 Just be careful with the last line. If it does not have any \R (newline sequence) at the end, it will be included even if there is multiple occurrences of that line before... Musashi 1 “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
Factfinder Posted March 20, 2021 Share Posted March 20, 2021 28 minutes ago, Nine said: Just be careful with the last line. If it does not have any \R (newline sequence) at the end, it will be included even if there is multiple occurrences of that line before... Good point. This should take care of that: $freg = StringRegExpReplace (FileRead("testfile1.txt"), '(?s)(.+?)\R(\1(\R|$))+', '') Link to comment Share on other sites More sharing options...
Nine Posted March 20, 2021 Share Posted March 20, 2021 Yes it does . Tested speed with : #include <Constants.au3> $hTimer = TimerInit() $freg = StringRegExpReplace (FileRead("testfile.txt"), '(?s)(.+?)\R(\1(\R|$))+', '') ConsoleWrite (TimerDiff($hTimer) & @CRLF) $hTimer = TimerInit() $sNew = StringRegExpReplace(FileRead("testfile.txt"), "(?m)^(.+?)\R(\1(\R|$))+", '') ConsoleWrite (TimerDiff($hTimer) & @CRLF) MsgBox ($MB_SYSTEMMODAL, "", $freg = $sNew) Quote +>Setting Hotkeys...--> Press Ctrl+Alt+Break to Restart or Ctrl+BREAK to Stop. 833.19996119682 1.2710359091648 +>08:07:28 AutoIt3.exe ended.rc:0 Both provide the same result. “They did not know it was impossible, so they did it” ― Mark Twain Spoiler Block all input without UAC Save/Retrieve Images to/from Text Monitor Management (VCP commands) Tool to search in text (au3) files Date Range Picker Virtual Desktop Manager Sudoku Game 2020 Overlapped Named Pipe IPC HotString 2.0 - Hot keys with string x64 Bitwise Operations Multi-keyboards HotKeySet Recursive Array Display Fast and simple WCD IPC Multiple Folders Selector Printer Manager GIF Animation (cached) Screen Scraping Multi-Threading Made Easy Link to comment Share on other sites More sharing options...
Factfinder Posted March 20, 2021 Share Posted March 20, 2021 Thanks for the confirmation and testing the speed. I was curios about the speed. Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now