phatzilla Posted October 23, 2014 Share Posted October 23, 2014 Hey ya'll, So im in a bit of a pickle here. I have a text file, file1.txt with a few hundred lines, some of these lines are exact duplicates. What i'm trying to do is pick out the duplicate lines, and sort them out by number of occurrences and write the 10 most frequently seen duplicate lines to another file. So the "top 10" duplicate lines from file1.txt would be output to file2.txt I've looked around at the _array functions but i only see a way to remove duplicates and make unique arrays, which isn't exactly what i need... Any help? Link to comment Share on other sites More sharing options...
Luigi Posted October 23, 2014 Share Posted October 23, 2014 (edited) Hi @phatzilla, please, post a file`s example. I have an idea. Br, Detefon Edited October 23, 2014 by Detefon Visit my repository Link to comment Share on other sites More sharing options...
phatzilla Posted October 23, 2014 Author Share Posted October 23, 2014 Here; it is a list of twitter trends expandcollapse popup#SheNeverLeft #1Dbigannouncement "Diwali in India" "Between Two Ferns" #OnTheRoadAgain1D "Jeanie Buss" #OttawaShooting #FastFoodSlogans "Darnell Coles" "Avengers 2" #1Dbigannouncement #MostClutch "Between Two Ferns" #OnTheRoadAgainTour "Happy Mole Day" #OTRATour #AYASummit "Jeanie Buss" "Avengers 2" "Lisa Ann" #StartSitESPN "Happy Mole Day" #OttawaShooting "Between Two Ferns" "Jeanie Buss" #BryantAndNashNewVideo #LameApocalypses "Kevin Vickers" #poptech "Avengers 2" "Happy Mole Day" #LameApocalypses #OttawaShooting #AvengersAgeOfUltron #AYASummit #Engage2014 Halloween Canada Christmas Scorpio "Happy Mole Day" #AvengersAgeOfUltron #OttawaShooting #LameApocalypses #indysm #HappyBirthdayGrandpaGrande "Avengers 2" Halloween Canada Christmas #HappyBirthdayGrandpaGrande "Happy Mole Day" #LameApocalypses #OttawaShooting #AvengersAgeOfUltron #StealMyGIF "Avengers 2" Canada Halloween Christmas #LameApocalypses "Happy Mole Day" #AvengersAgeOfUltron #ICryAtRavesWhen #OttawaShooting #AgeofUltron "Avengers 2" Canada Halloween "White House" #ICryAtRavesWhen #LameApocalypses #AvengersAgeOfUltron #PandaFunkFamily #OttawaShooting "Happy Diwali" "Avengers 2" Halloween "Frank Ocean" Canada #ICryAtRavesWhen #PandaFunkFamily #LameApocalypses #AvengersAgeOfUltron #ZachGrandtourage "Happy Diwali" Ottawa "Kim Possible" "Lizzie McGuire" Halloween #ICryAtRavesWhen #LameApocalypses #AvengersAgeOfUltron #AgeofUltron "Happy Diwali" #Paperwork "Even Stevens" Halloween "Jessica Lange" "That's So Raven" #ICryAtRavesWhen #LameApocalypses #AvengersAgeOfUltron #AgeofUltron "Thinking About You - Frank Ocean" "Gods & Monsters" "Edward Mordrake" "Happy Diwali" Halloween "Jessica Lange" #LameApocalypses #AvengersAgeOfUltron #ICryAtRavesWhen "Lurie Poston" #thankyouvessel Viscant #OttawaShooting "Happy Diwali" "Lizzie McGuire" "Gods and Monsters" #LameApocalypses #thankyouvessel #AvengersAgeOfUltron #AgeofUltron #WorldSeriesGame2 "Gods & Monsters" "S Club 7" "Mark Jackson" "Happy Diwali" "Legally Blonde" #LameApocalypses #WorldSeriesGame2 #AvengersAgeOfUltron #thankyouvessel #DontAskBeau "Nick Swisher" "Zach Mettenberger" "Teaser Trail" PrincAss "Edward Mordrake" #WorldSeriesGame2 #AskBeau Strickland #VoightsRage #tiannaQA #AgeofUltron Dora Patti "Zach Mettenberger" "Teaser Trail" #ReplaceAnAnimeTitleWithAss "One in 5,000" #CrawfordsNewVideo #100Things #AvengersAgeOfUltron "My Cinnamon Twist" #WorldSeriesGame2 "Thinking About You - Frank Ocean" Kunitz "Paranormal Activity 3" #ReplaceAnAnimeTitleWithAss #BabyDaddyChat #WorldSeriesGame2 Ultron #OttawaShooting #NYGovDebate "Joe Torre" "Key & Peele" "Watching Casper" "Oliver and Thea" #willmakesushappy #ReplaceAnAnimeTitleWithAss #AvengersAgeOfUltron #ignitethegrind #AskSierraDallas "James Spader" "White House" Drumline Canada "Young Thug" #BryantAndNashNewVideo #ASKLOHANTHONY #OttawaShooting #Z100Rules "Nathan Cirillo" "Michael Zehaf-Bibeau" Canada Halloween Drumline "Jersey Shore" #BryantAndNashNewVideo #ListenToGhostOnYouTube #OttawaShooting #5SOSAmnesiaLyrics #BigTimeLyrics "Nathan Cirillo" "Michael Zehaf-Bibeau" "Ben Bradlee" Makonnen Inbox #5SOSAmnesiaLyrics #ANDvAFC #OttawaShooting #yesboo #SELFIEFORSEB "Liverpool 0-3 Real Madrid" Poldi Podolski "WHY IS FOOD SO GOOD" Olympiacos #AskZachAttack #LiverpoolVsRealMadrid #OttawaShooting #BryantAndNashNewVideo #IfICouldTimeTravel Reus "Google Inbox" Coutinho "Ben Bradlee" Mignolet #AskZachAttack #LiverpoolVsRealMadrid #OttawaShooting "You'll Never Walk Alone" #IfICouldTimeTravel #BryantAndNashNewVideo "David J. Stern Sports Scholarship" "Google Inbox" Reds Parliament #PrayForOttawa #OttawaShooting #IfICouldTimeTravel #StaySafeOttawa "Canadian Parliament" #twitterflight "Friday After Next" "S Club 7" "Google Inbox" Crowder Link to comment Share on other sites More sharing options...
MikahS Posted October 23, 2014 Share Posted October 23, 2014 (edited) How about putting each line into an array element, then make a copy of that array and then use _ArrayUnique to filter out the duplicates. Then using a for loop use StringReplace to search through the original array (arraytostring with a delimiter) and you can then find out how many times it found an exact match in the @extended macro. You can Make a 2D array and put the string you searched for, and the @extended info in the second. $array[1][2] = [['sample string', 5]] like so. Please ask if you have any questions. Give it a try and post script and I'm sure all of us would be happy to help. Edited October 23, 2014 by MikahS Snips & Scripts My Snips: graphCPUTemp ~ getENVvarsMy Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4 Feel free to use any of my code for your own use. Forum FAQ Link to comment Share on other sites More sharing options...
phatzilla Posted October 23, 2014 Author Share Posted October 23, 2014 (edited) MikahS, Thanks for the help, i followed what you did and i am getting the proper info from my For loop. Right now im just stuck on making the 2d array and then only taking the top 10 most seen occurences..... Edited October 23, 2014 by phatzilla Link to comment Share on other sites More sharing options...
MikahS Posted October 23, 2014 Share Posted October 23, 2014 show me what you have accomplished and I'd be happy to take a look. Snips & Scripts My Snips: graphCPUTemp ~ getENVvarsMy Scripts: Short-Order Encrypter - message and file encryption V1.6.1 ~ AuPad - Notepad written entirely in AutoIt V1.9.4 Feel free to use any of my code for your own use. Forum FAQ Link to comment Share on other sites More sharing options...
phatzilla Posted October 23, 2014 Author Share Posted October 23, 2014 (edited) To keep it short, what i have is basically for $l = 1 to ubound($unique_array) - 1 global $replace = StringReplace($trend_array_string,$unique_array[$l],"ReplacementString") global $numreplacements = @extended ConsoleWrite("The number of replacements done was : " & $unique_array[$l] & " : " & $numreplacements & @CRLF) Next And the output expandcollapse popupThe number of replacements done was : #SheNeverLeft : 2 The number of replacements done was : #LL2014 : 1 The number of replacements done was : "Between Two Ferns" : 4 The number of replacements done was : #askgrizfolk : 1 The number of replacements done was : #LameApocalypses : 13 The number of replacements done was : "Gold Glove" : 1 The number of replacements done was : "Diwali in India" : 2 The number of replacements done was : "Jeanie Buss" : 4 The number of replacements done was : #1Dbigannouncement : 3 The number of replacements done was : "Avengers 2" : 8 The number of replacements done was : #OnTheRoadAgain1D : 1 The number of replacements done was : #OttawaShooting : 14 The number of replacements done was : #FastFoodSlogans : 1 The number of replacements done was : "Darnell Coles" : 1 The number of replacements done was : #MostClutch : 1 The number of replacements done was : #OnTheRoadAgainTour : 1 The number of replacements done was : "Happy Mole Day" : 6 The number of replacements done was : #OTRATour : 1 The number of replacements done was : #AYASummit : 2 The number of replacements done was : "Lisa Ann" : 1 The number of replacements done was : #StartSitESPN : 1 The number of replacements done was : #BryantAndNashNewVideo : 5 The number of replacements done was : "Kevin Vickers" : 1 The number of replacements done was : #poptech : 1 The number of replacements done was : #AvengersAgeOfUltron : 13 The number of replacements done was : #Engage2014 : 1 The number of replacements done was : Halloween : 9 The number of replacements done was : Canada : 7 The number of replacements done was : Christmas : 3 The number of replacements done was : Scorpio : 1 The number of replacements done was : #indysm : 1 The number of replacements done was : #HappyBirthdayGrandpaGrande : 2 The number of replacements done was : #StealMyGIF : 1 The number of replacements done was : #ICryAtRavesWhen : 6 The number of replacements done was : #AgeofUltron : 5 The number of replacements done was : "White House" : 2 The number of replacements done was : #PandaFunkFamily : 2 The number of replacements done was : "Happy Diwali" : 6 The number of replacements done was : "Frank Ocean" : 1 The number of replacements done was : #ZachGrandtourage : 1 The number of replacements done was : Ottawa : 15 The number of replacements done was : "Kim Possible" : 1 The number of replacements done was : "Lizzie McGuire" : 2 The number of replacements done was : #Paperwork : 1 The number of replacements done was : "Even Stevens" : 1 The number of replacements done was : "Jessica Lange" : 2 The number of replacements done was : "That's So Raven" : 1 The number of replacements done was : "Thinking About You - Frank Ocean" : 2 The number of replacements done was : "Gods & Monsters" : 2 The number of replacements done was : "Edward Mordrake" : 2 The number of replacements done was : "Lurie Poston" : 1 The number of replacements done was : #thankyouvessel : 3 The number of replacements done was : Viscant : 1 The number of replacements done was : "Gods and Monsters" : 1 The number of replacements done was : #WorldSeriesGame2 : 5 The number of replacements done was : "S Club 7" : 1 The number of replacements done was : "Mark Jackson" : 1 The number of replacements done was : "Legally Blonde" : 1 The number of replacements done was : #DontAskBeau : 1 The number of replacements done was : "Nick Swisher" : 1 The number of replacements done was : "Zach Mettenberger" : 2 The number of replacements done was : "Teaser Trail" : 2 The number of replacements done was : PrincAss : 1 The number of replacements done was : #AskBeau : 1 The number of replacements done was : Strickland : 1 The number of replacements done was : #VoightsRage : 1 The number of replacements done was : #tiannaQA : 1 The number of replacements done was : Dora : 1 The number of replacements done was : Patti : 1 The number of replacements done was : #ReplaceAnAnimeTitleWithAss : 3 The number of replacements done was : "One in 5,000" : 1 The number of replacements done was : #CrawfordsNewVideo : 1 The number of replacements done was : #100Things : 1 The number of replacements done was : "My Cinnamon Twist" : 1 The number of replacements done was : Kunitz : 1 The number of replacements done was : "Paranormal Activity 3" : 1 The number of replacements done was : #BabyDaddyChat : 1 The number of replacements done was : Ultron : 19 The number of replacements done was : #NYGovDebate : 1 The number of replacements done was : "Joe Torre" : 1 The number of replacements done was : "Key & Peele" : 1 The number of replacements done was : "Watching Casper" : 1 The number of replacements done was : "Oliver and Thea" : 1 The number of replacements done was : #willmakesushappy : 1 The number of replacements done was : #ignitethegrind : 1 The number of replacements done was : #AskSierraDallas : 1 The number of replacements done was : "James Spader" : 1 The number of replacements done was : Drumline : 2 The number of replacements done was : "Young Thug" : 1 The number of replacements done was : #ASKLOHANTHONY : 1 The number of replacements done was : #Z100Rules : 1 The number of replacements done was : "Nathan Cirillo" : 2 The number of replacements done was : "Michael Zehaf-Bibeau" : 2 The number of replacements done was : "Jersey Shore" : 1 The number of replacements done was : #ListenToGhostOnYouTube : 1 The number of replacements done was : #5SOSAmnesiaLyrics : 2 The number of replacements done was : #BigTimeLyrics : 1 The number of replacements done was : "Ben Bradlee" : 2 The number of replacements done was : Makonnen : 1 The number of replacements done was : Inbox : 3 The number of replacements done was : #ANDvAFC : 1 The number of replacements done was : #yesboo : 1 The number of replacements done was : #SELFIEFORSEB : 1 The number of replacements done was : "Liverpool 0-3 Real Madrid" : 1 The number of replacements done was : Poldi : 1 The number of replacements done was : Podolski : 1 The number of replacements done was : "WHY IS FOOD SO GOOD" : 1 The number of replacements done was : Olympiacos : 1 The number of replacements done was : #AskZachAttack : 2 The number of replacements done was : #LiverpoolVsRealMadrid : 2 The number of replacements done was : #IfICouldTimeTravel : 2 The number of replacements done was : Reus : 1 The number of replacements done was : "Google Inbox" : 2 The number of replacements done was : Coutinho : 1 The number of replacements done was : Mignolet : 1 The number of replacements done was : "You'll Never Walk Alone" : 1 The number of replacements done was : "David J. Stern Sports Scholarship" : 1 The number of replacements done was : Reds : 1 The number of replacements done was : Parliament : 1 So now i have the unique list, with the corresponding amount of occurences. How would i extract the top X lines? Edited October 23, 2014 by phatzilla Link to comment Share on other sites More sharing options...
kylomas Posted October 23, 2014 Share Posted October 23, 2014 (edited) phatzilla, To keep it short No need to keep it short, show your whole script. The script that you posted cannot work. The solution that mikahS posted is about 13 lines long... kylomas edit: comment struck out edit2: About the code you posted: You shouldn't declare variables in a loop It is not necessary to populate a variable to get stringreplace to set @EXTENDED Edited October 23, 2014 by kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
Spider001 Posted October 23, 2014 Share Posted October 23, 2014 test.txt is file with your data from post #3 #include <array.au3> $data = FileReadToArray('test.txt') _ArraySort($data) _ArrayDisplay($data) $search = '' Global $dup[0] For $i = UBound($data)-1 To 0 step - 1 If $data[$i] <> $search Then $search = $data[$i] Else _ArrayAdd($dup,$data[$i]) _ArrayDelete($data,$i) EndIf Next _ArraySort($dup) _ArrayDisplay($dup) _ArrayDisplay($data) Link to comment Share on other sites More sharing options...
kylomas Posted October 23, 2014 Share Posted October 23, 2014 Spider001, How does that answer to OP? Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
jguinch Posted October 23, 2014 Share Posted October 23, 2014 (edited) Here is a way : #Include <Array.au3> Local $iCount Local $sData = FileRead("data.txt") ; Uniq lines Local $aDuplicates = StringRegExp($sData, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\1)", 3) Local $aResult[ UBound($aDuplicates)][2] For $i = 0 To UBound($aDuplicates) - 1 $iCount = UBound( StringRegExp($sData, "(?:\A|\R)\Q" & $aDuplicates[$i] & "\E(?=\R|\Z)", 3) ) $aResult[$i][0] = $aDuplicates[$i] $aResult[$i][1] = $iCount Next _ArraySort($aResult, 1, 0, 0, 1) ; Delete uniq rows ######################## For $i = UBound($aResult) - 1 To 0 Step -1 If $aResult[$i][1] > 1 Then ExitLoop Next Redim $aResult[$i + 1][2] ; ######################################### _ArrayDisplay($aResult) Edited October 24, 2014 by jguinch kylomas 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
Gianni Posted October 23, 2014 Share Posted October 23, 2014 (edited) another way #include <array.au3> Global $aData = FileReadToArray('File1.txt'), $aDuplicates[0][2], $iIndex = 0 _ArrayInsert($aData, 0) ; make array 1-based _ArraySort($aData) ; sort data in input For $i = 1 To UBound($aData) - 1 ; loop all elements of sorted data $aCount = _ArrayFindAll($aData, $aData[$i]) ; for each element count how many there are $nCount = UBound($aCount) If $nCount > 1 Then ; if there are more than 1 ReDim $aDuplicates[UBound($aDuplicates) + 1][2] ; make room in output array $aDuplicates[$iIndex][0] = $aData[$i] ; insert it's value in output array $aDuplicates[$iIndex][1] = $nCount ; and how many there are $iIndex += 1 ; point to next free output element $i += $nCount - 1 ; skip the remaining same elements EndIf Next If UBound($aDuplicates) Then _ArraySort($aDuplicates, 1, 0, 0, 1) _ArrayDisplay($aDuplicates) Else MsgBox(0, "Result", "There are no duplicates") EndIf edit: removed previous listing with a bug added comments Edited October 24, 2014 by Chimp Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Gianni Posted October 23, 2014 Share Posted October 23, 2014 Here is a way : #Include <Array.au3> Local $iCount Local $sData = FileRead("data.txt") ; Uniq lines Local $aDuplicates = StringRegExp($sData, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\1)", 3) Local $aResult[ UBound($aDuplicates)][2] For $i = 0 To UBound($aDuplicates) - 1 $iCount = UBound( StringRegExp($sData, "(?:\A|\R)\Q" & $aDuplicates[$i] & "\E(?:\R|\Z)", 3) ) $aResult[$i][0] = $aDuplicates[$i] $aResult[$i][1] = $iCount Next _ArraySort($aResult, 1, 0, 0, 1) ; Delete uniq rows ######################## For $i = UBound($aResult) - 1 To 0 Step -1 If $aResult[$i][1] > 1 Then ExitLoop Next Redim $aResult[$i + 1][2] ; ######################################### _ArrayDisplay($aResult) if you have in input a file with only the same value repeated more times your function fails try with a file in input like this for example: one one or like this: 123 123 123 123 Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
kylomas Posted October 24, 2014 Share Posted October 24, 2014 (edited) This seems to work... #include <array.au3> local $str = fileread(@desktopdir & '\test2.txt') ; create string var from file $str = stringregexpreplace($str,'(.+)(\R|$)','`\1`' & @crlf) ; delimit string for stringreplace (else "aa" would match "aaa" and "aaaa") local $aStr1 = stringsplit($str,@crlf,3) ; create array from string var $aStr1 = _arrayunique($aStr1,0,0,0,0) ; eleminate duplicate entries local $aStr2[ubound($aStr1-1)][2] ; create 2D array sized to 1st array for $1 = 0 to ubound($aStr1) - 1 ; loop thru array stringreplace($str,$aStr1[$1],'') ; get # of occurrences from string $aStr2[$1][1] = @extended ; populate count $aStr2[$1][0] = stringregexpreplace($aStr1[$1],'`(.*)`','\1') ; populate string Next _arraysort($aStr2,1,0,0,1) ; sort on count column redim $aStr2[10][2] ; cut array down to 10 entries _arraydisplay($aStr2) ; viola @jguinch - I think I'm going to love your use of regexp, if I ever figure it out... Edited October 24, 2014 by kylomas Forum Rules Procedure for posting code "I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals." - Sir Winston Churchill Link to comment Share on other sites More sharing options...
jguinch Posted October 24, 2014 Share Posted October 24, 2014 You are right Chimp. I edited my code : replace (? by (?=) in the 2nd regex (it was an oversight ) Thanks kylomas. An similar code, but with a suppression of non duplicates lines at the beginning : #Include <Array.au3> Local $iCount Local $sData = FileRead("data.txt") ; Eliminate non Duplicates Local $sDuplicates = StringRegExpReplace($sData, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\R\1)", "") ; Duplicates in an Array (uniq rows) Local $aDuplicates = StringRegExp($sDuplicates, "(?s)(?:\A|\R)(\N+)(?=\R|\Z)(?!.*\1)", 3) Local $aResult[ UBound($aDuplicates)][2] For $i = 0 To UBound($aDuplicates) - 1 $aResult[$i][0] = $aDuplicates[$i] $aResult[$i][1] = UBound( StringRegExp($sData, "(?:\A|\R)\Q" & $aDuplicates[$i] & "\E(?=\R|\Z)", 3) ) Next _ArraySort($aResult, 1, 0, 0, 1) _ArrayDisplay($aResult) coles 1 Spoiler Network configuration UDF, _DirGetSizeByExtension, _UninstallList Firefox ConfigurationArray multi-dimensions, Printer Management UDF Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now