kosamja Posted August 23, 2017 Share Posted August 23, 2017 (edited) Hi, hope someone can help me with my problem. I am trying to: 1) read content of text file 2) fix formatting which means: a) remove empty lines at beginning of file and spaces at beginning of lines b ) replace multiple empty lines between paragraphs with one line and multiple spaces inside of line with one space c) if after that first character in line is lowercase and previous line is not empty then merge it with previous line, otherwise keep line unchanged 3) convert letters 4) remove duplicate lines and write to RTF But its currently slow for bigger txt files(1MB+). Any chance to make it faster with RegExp? tnx example: this $Cyrillic = 'љ|Љ|њ|Њ|џ|Џ| a|A|б|Б|в|В|г|Г|д|Д|ђ|Ђ|е|Е|ж|Ж|з|З|и|И|ј|Ј|к| К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С| т|Т|ћ|Ћ |у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш' should be changed to $Cyrillic = 'љ|Љ|њ|Њ|џ|Џ|a|A|б|Б|в|В|г|Г|д|Д|ђ|Ђ|е|Е|ж|Ж|з|З|и|И|ј|Ј|к| К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С| т|Т|ћ|Ћ |у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш' expandcollapse popup#NoTrayIcon #RequireAdmin #include <File.au3> #include <Constants.au3> #include <GUIConstants.au3> #include <WinAPI.au3> #include <Array.au3> Opt("WinWaitDelay", 0) Opt("MouseClickDelay", 0) Opt("MouseClickDownDelay", 0) Opt("MouseClickDragDelay", 0) Opt("SendKeyDelay", 0) Opt("SendKeyDownDelay", 0) Opt("WinTitleMatchMode", 3) FileChangeDir(StringRegExpReplace(@ScriptDir, '\\+$', '')) Global $Convert = 'Cyrillic' ;$Convert = 'Latin' Global $Cyrillic = 'љ|Љ|њ|Њ|џ|Џ|a|A|б|Б|в|В|г|Г|д|Д|ђ|Ђ|е|Е|ж|Ж|з|З|и|И|ј|Ј|к|К|л|Л|м|М|н|Н|о|О|п|П|р|Р|с|С|т|Т|ћ|Ћ|у|У|ф|Ф|х|Х|ц|Ц|ч|Ч|ш|Ш' Global $Latin = 'lj|Lj|nj|Nj|dž|Dž|a|A|b|B|v|V|g|G|d|D|đ|Đ|e|E|ž|Ž|z|Z|i|I|j|J|k|K|l|L|m|M|n|N|o|O|p|P|r|R|s|S|t|T|ć|Ć|u|U|f|F|h|H|c|C|č|Č|š|Š' Global $CyrillicCharList = StringSplit($Cyrillic, '|') Global $LatinCharList = StringSplit($Latin, '|') ;txt file _Convert($CmdLine[1]) Func _Convert($sPath) $sConvertedText = _FormattingFix(FileRead($sPath)) $sConvertedText = _Transliterate($sConvertedText, $Convert) Return $sConvertedText EndFunc Func _Transliterate($sText, $sConversion = 'Latin') $sText = StringReplace($sText, 'dz', 'dž', 0, $STR_CASESENSE) $sText = StringReplace($sText, 'Dz', 'Dž', 0, $STR_CASESENSE) $sText = StringReplace($sText, 'DZ', 'Dž', 0, $STR_CASESENSE) $sText = StringReplace($sText, 'DŽ', 'Dž', 0, $STR_CASESENSE) $sText = StringReplace($sText, 'LJ', 'Lj', 0, $STR_CASESENSE) $sText = StringReplace($sText, 'NJ', 'Nj', 0, $STR_CASESENSE) For $i = 1 to 60 If $sConversion = 'Latin' Then $sText = StringReplace($sText, $CyrillicCharList[$i], $LatinCharList[$i], 0, $STR_CASESENSE) Else $sText = StringReplace($sText, $LatinCharList[$i], $CyrillicCharList[$i], 0, $STR_CASESENSE) EndIf Next Return $sText EndFunc Func _FormattingFix($sText) $sFixedText = '' $IsFirstNonWhitespaceLineFound = False $sLines = StringSplit($sText, @LF) For $i = 1 to $sLines[0] $sString = StringStripWS(StringStripCR($sLines[$i]), $STR_STRIPLEADING + $STR_STRIPTRAILING + $STR_STRIPSPACES) $sFirstChar = StringLeft($sString, 1) Select Case $IsFirstNonWhitespaceLineFound = False and not StringIsSpace($sFirstChar) $sFixedText = $sString $IsFirstNonWhitespaceLineFound = True Case StringIsUpper($sFirstChar) If StringIsUpper(StringLeft(StringStripWS(StringStripCR($sLines[$i-1]), $STR_STRIPLEADING + $STR_STRIPTRAILING), 1)) Then $sFixedText = $sFixedText & @CRLF & $sString Else $sFixedText = $sFixedText & @CRLF & @CRLF & $sString EndIf Case StringIsLower($sFirstChar) $sFixedText = $sFixedText & ' ' & $sString Case StringIsSpace($sFirstChar) ;ignore empty lines Case Else $sFixedText = $sFixedText & @CRLF & $sString EndSelect Next $sAppendAtEnd = @CRLF If StringIsSpace(StringStripCR($sLines[$sLines[0]])) Then $sAppendAtEnd = @CRLF & @CRLF Return $sFixedText & $sAppendAtEnd EndFunc Edited September 4, 2017 by kosamja Link to comment Share on other sites More sharing options...
jguinch Posted August 23, 2017 Share Posted August 23, 2017 For the 2) , here is a way : ; remove empty lines at beginning of file and spaces at beginning of lines $newString = StringRegExpReplace($string, "^\R+\h*|\R\K\h+", "") ;replace multiple empty lines between paragraphs with one line and multiple spaces inside of line with one space $newString = StringRegExpReplace($newString, "\R{2}\K\R+|\h\K\h+", "") ; if after that first character in line is lowercase and previous line is not empty then merge it with previous line, otherwise keep line unchanged $newString = StringRegExpReplace($newString, "\V+\K\R(?=[[:lower:]])", "") kosamja 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
kosamja Posted August 23, 2017 Author Share Posted August 23, 2017 (edited) Hi jguinch, thanks for answering, it works perfect. I have 2 more questions: 1) to remove duplicate lines with autoit i need to use _ArrayUnique? 2) what would be RegExp version for this $sLines = StringSplit($sText, @LF) $sFixedText = '' For $i = 1 to $sLines[0] If not StringIsSpace(StringStripCR($sLines[$i])) Then $sAppendBetween = '' If StringIsSpace(StringStripCR($sLines[$i-1])) Then $sAppendBetween = '\line ' $sFixedText = $sFixedText & '{' & $sAppendBetween & '\pard \fs24 \ql \f0 \li0 \fi0 ' & StringStripCR($sLines[$i]) & '\par}' & @CRLF EndIf Next a) if line is not empty replace it with {\pard \fs24 \ql \f0 \li0 \fi0 (Content Of Line) \par} (add {\pard \fs24 \ql \f0 \li0 \fi0 at beginning of each non empty line and add \par} at end of each non empty line) b ) if line is empty replace it with {\line \pard \fs24 \ql \f0 \li0 \fi0 \par} Edited August 23, 2017 by kosamja Link to comment Share on other sites More sharing options...
kosamja Posted August 24, 2017 Author Share Posted August 24, 2017 (edited) is this correct way to do it? ;insert at begin of non empty lines $newString = StringRegExpReplace($newString, "(?m)(.+)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \0") ;insert at end of non empty lines $newString = StringRegExpReplace($newString, "(?m)(\R+)"," \\par}\0") ;insert at empty lines $newString = StringRegExpReplace($newString, "(?m)(^\R)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \\par}\0") 1 more question: How to remove spaces from end of each line with RegExp? Is this correct way to do it: $newString = StringRegExpReplace($newString, "(?m)^[ \t]+|[ \t]+(\R)","\1") Edited August 24, 2017 by kosamja Link to comment Share on other sites More sharing options...
jguinch Posted August 24, 2017 Share Posted August 24, 2017 ;remove spaces from end of each line $newString = StringRegExpReplace($newString, "\h+(?=\R)","") ;insert at begin of non empty lines $newString = StringRegExpReplace($newString, "(?:^|\R)\K(?!\R)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \\0") ;insert at end of non empty lines $newString = StringRegExpReplace($newString, "\N+\K"," \\par}\\0") ;insert at empty lines $newString = StringRegExpReplace($newString, "(?:^|\R)\K(?=\R)","{\\pard \\fs24 \\ql \\f0 \\li0 \\fi0 \\par}\\0") ConsoleWrite($newString) kosamja 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now