jcpetu Posted September 8, 2020 Share Posted September 8, 2020 Hi people, I'm trying to get some URLs from a web site and I do it in two phases. First, (thanks to TheXman) I extract all URLs between href=" ". Then I have to cycle to each array element returned by StringRegExp in order to reject the URLs I don't need. I wonder if it's any way of speed it up by using a StringRegExp in order to avoid bringing those URLs in the first step. For instance, if it's possible in the first phase by using StringRegExp , I would like to bring all URLs but those with .png, ico, jpeg, jpg and css. I was trying to understand how to do it but StringRegExp is a language by itself. And if it's possible, with StringRegExp as well, to filter the URLs that reference the same domain, so I will be able to reduce the if clause. Thanks a lot in advance. $Host="mesi.com" $site = 'class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-suma-cuatro-goles-en-tres-partidos-ante-el-bayern/">SEA SUMA CUATRO GOLES EN TRES PARTIDOS ANTE EL BAYERN</a></div><div class="desc_noticies"><p>Sea Jcpe suma cuatro goles en tres enfrentamientos contra el Bayern de Múnich en la Liga de Campeones: dos en […]</p></div></div></div><div class="post_grid_noticies jcpe_noti_4"><div class="contenidor-zoom-out"><a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-scaled.jpg?v=1596923563 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-300x300.jpg?v=1596923563 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1024x1024.jpg?v=1596923563 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-150x150.jpg?v=1596923563 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-768x768.jpg?v=1596923563 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-1536x1536.jpg?v=1596923563 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-2048x2048.jpg?v=1596923563 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Crónica-Napoli-75x75.jpg?v=1596923563 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/sea-marca-en-la-eliminacion-del-napoli/">SEA JCPE MARCA EN LA CLASIFICACIÓN CONTRA EL NAPOLI</a></div><div class="desc_noticies"><p>Sea Jcpe ha marcado un gol en la victoria del Equipo ante el Napoli por 3-1, que supone la clasificación […]</p></div></div></div><div class="post_grid_noticies jcpe_noti_5"><div class="contenidor-zoom-out"><a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/"><img width="2560" height="2560" src="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556" class="img_grid_notis wp-post-image" alt="" srcset="https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-scaled.jpg?v=1596709556 2560w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-300x300.jpg?v=1596709556 300w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1024x1024.jpg?v=1596709556 1024w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-150x150.jpg?v=1596709556 150w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-768x768.jpg?v=1596709556 768w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-1536x1536.jpg?v=1596709556 1536w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-2048x2048.jpg?v=1596709556 2048w, https://static.jcpe.com/wp-content/uploads/2020/08/Previa-Champions-75x75.jpg?v=1596709556 75w" sizes="(max-width: 2560px) 100vw, 2560px" /></a></div><div class="contingut_noticies"><div class="tags_noticies"></div><div class="titol_noticies"> <a href="https://jcpe.com/el-equipo-a-por-los-cuartos-de-final-de-la-champions/">EL EQUIPO, A POR LOS CUARTOS DE FINAL DE LA CHAMPION...</a></div><div class="desc_noticies"><p>El Equipo buscará este sábado en el Camp Nou la clasificación para los cuartos de final de la Liga de […]</p></div></div></div></div></div></div><div class="mas-noticias mes-noticies"> <a href="noticias">Más noticias' $aUrl = StringRegExpReplace($site, "(?i)href=[""'](.*?)[""']|\z;", 3) For $i = 1 To UBound($aUrl) - 1 If (StringInStr($aUrl[$i], ".com") Or StringInStr($aUrl[$i], "www.")) And _ (Not StringInStr($aUrl[$i], $Host)) Then ;filter external domains differents than $Host=site.com ElseIf StringInStr($aUrl[$i], "http") And Not StringInStr($aUrl[$i], $Host) Then ;filter external domains differents than $Host=site.com ElseIf StringInStr($aUrl[$i], ".png") Or StringInStr($aUrl[$i], ".ico") Or _ StringInStr($aUrl[$i], ".jpg") Or StringInStr($aUrl[$i], ".jpeg") Or _ StringInStr($aUrl[$i], ".css") Then ;filter non desired elements ElseIf $aUrl[$i] = "" Or _ $aUrl[$i] = "/" Or _ $aUrl[$i] = $Host Or _ $aUrl[$i] = "http://" & $Host Or _ $aUrl[$i] = "https://" & $Host Or _ $aUrl[$i] = "http://www." & $Host Or _ $aUrl[$i] = "https://www." & $Host Or _ $aUrl[$i] = $Host & "/" Or _ $aUrl[$i] = "http://" & $Host & "/" Or _ $aUrl[$i] = "https://" & $Host & "/" Or _ $aUrl[$i] = "http://www." & $Host & "/" Or _ $aUrl[$i] = "https://www." & $Host & "/" Then ;filter same domain Endif Next Link to comment Share on other sites More sharing options...
Danp2 Posted September 8, 2020 Share Posted September 8, 2020 Your code uses StringRegExpReplace. Shouldn't that be StringRegExp instead? Also, non of the matching links contain the extensions that you are checking for. Perhaps not the best example? Latest Webdriver UDF Release Webdriver Wiki FAQs Link to comment Share on other sites More sharing options...
jcpetu Posted September 8, 2020 Author Share Posted September 8, 2020 (edited) Hi Danp2, yes I'm sorry it was a cut and paste error, the correct code is: expandcollapse popup#include <array.au3> #include <Debug.au3> #include <String.au3> #include "WinHttp.au3" Local $hOpen = _WinHttpOpen() If @error Then MsgBox(48, "Error", "Error initializing the usage of WinHTTP functions.") Exit EndIf Local $Host = "messi.com" Local $hConnect = _WinHttpConnect($hOpen, $Host) ; <- yours here If @error Then MsgBox(48, "Error", "Error specifying the initial target server of an HTTP request.") _WinHttpCloseHandle($hOpen) Exit EndIf Local $req = _WinHttpOpenRequest($hConnect) If @error Then MsgBox(48, "Error", "Error creating an HTTP request handle.") _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Exit EndIf _WinHttpSendRequest($req) If @error Then MsgBox(48, "Error", "Error sending specified request.") _WinHttpCloseHandle($req) _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Exit EndIf _WinHttpReceiveResponse($req) ;------------------------ Wait for the response If @error Then MsgBox(48, "Error", "Error waiting for the response from the server.") _WinHttpCloseHandle($req) _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Exit EndIf Local $sChunk, $gsHTML If _WinHttpQueryDataAvailable($req) Then ;------------- See if there is data to read While 1 $sChunk = _WinHttpReadData($req) If @error Then ExitLoop $gsHTML &= $sChunk WEnd ConsoleWrite($gsHTML & @CRLF) ; print to console $aUrl = _ArrayUnique(StringRegExp($gsHTML, 'href=(?:"|'')([^"'']+)', 3)) For $i = 1 To UBound($aUrl) - 1 If (StringInStr($aUrl[$i], ".com") Or StringInStr($aUrl[$i], "www.")) And _ (Not StringInStr($aUrl[$i], $Host)) Then ;filter external domains differents than $Host=site.com ElseIf StringInStr($aUrl[$i], "http") And Not StringInStr($aUrl[$i], $Host) Then ;filter external domains differents than $Host=site.com ElseIf StringInStr($aUrl[$i], ".png") Or StringInStr($aUrl[$i], ".ico") Or _ StringInStr($aUrl[$i], ".jpg") Or StringInStr($aUrl[$i], ".jpeg") Or _ StringInStr($aUrl[$i], ".css") Then ;filter non desired elements ElseIf $aUrl[$i] = "" Or _ $aUrl[$i] = "/" Or _ $aUrl[$i] = $Host Or _ $aUrl[$i] = "http://" & $Host Or _ $aUrl[$i] = "https://" & $Host Or _ $aUrl[$i] = "http://www." & $Host Or _ $aUrl[$i] = "https://www." & $Host Or _ $aUrl[$i] = $Host & "/" Or _ $aUrl[$i] = "http://" & $Host & "/" Or _ $aUrl[$i] = "https://" & $Host & "/" Or _ $aUrl[$i] = "http://www." & $Host & "/" Or _ $aUrl[$i] = "https://www." & $Host & "/" Then ;filter same domain Endif Next Else MsgBox(48, "Error", "Site is experiencing problems.") EndIf _WinHttpCloseHandle($req) _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Edited September 9, 2020 by jcpetu Link to comment Share on other sites More sharing options...
Danp2 Posted September 9, 2020 Share Posted September 9, 2020 Is there a reason that you aren't using_INetGetSource? Also, your regex is only going to return the URL associated with links. Is that by design, because links don't generally have the extensions you are looking to exclude? Latest Webdriver UDF Release Webdriver Wiki FAQs Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 (edited) I'm using WinHttp because all my program runs with it. First I bring all the site content as with _INetGetSource and then I use RegExp to extract only the links. _ArrayUnique(StringRegExp($sresp, 'href=(?:"|'')([^"'']+)', 3)) For instance: [https://static.messi.com/wp-content/uploads/2019/10/cropped-logo--192x192.png], _ [https://static.messi.com/wp-content/uploads/2019/10/cropped-logo--180x180.png], and I would like to avoid bringing these links as well as png, ico, jpeg, jpg and css. and if it's possible to bring only the link, per example, instead of: [https://static.messi.com/wp-content/uploads/2019/10/cropped-logo--192x192.png] bring only: [https://static.messi.com/wp-content/uploads/2019/10/] until the last /. Linksfile.txt Edited September 9, 2020 by jcpetu Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 I mean, I would like, if it's possible to get rid off the links that reference .png, .ico, etc. elements. If the StringRegExp allows it. So I will avoid using the clause If: StringInStr($aUrl[$i], ".png") Link to comment Share on other sites More sharing options...
mikell Posted September 9, 2020 Share Posted September 9, 2020 Not sure I understood correctly the wanted result. Maybe this ? (tested with the "$site" string from post #1) #Include <Array.au3> $s = FileRead("test.txt") $res = StringRegExp($s, 'https://[^",]+(?|png|jpg|ico|css)(*SKIP)(*F)|https://[^",]+', 3) $res = _ArrayUnique($res) _ArrayDisplay($res) Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 Hi mikell, thanks a lot, for me RexExp is double dutch , it seems your magic does the trick of skiping the lines that contains png, jpg, ico & css right? That's part of the idea. Thing is to bring all lines with href= followed by " or ' until the last /, for instance: 1) href="https://site.com/sub1/sub2/bla bla bla......." should bring https://site.com/sub1/sub2/ 2) href='site.com/sub1/bla bla bla.......' should bring site.com/sub1/ 3) href="https://site.com/sub1/sub2/bla bla bla.png" skip line And ideally don't repeat lines with the same value, for instance the following line should be skipped because it's the same than line 1) but with different text after the last / (text text text instead of bla bla bla): 4) href="https://site.com/sub1/sub2/text text text......." I hope it's clear and thanks again. Link to comment Share on other sites More sharing options...
mikell Posted September 9, 2020 Share Posted September 9, 2020 These requirements are impossible to test with the text from post #1, please provide one we can work with Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 You can test with this code, thanks a lot: expandcollapse popup#include <array.au3> #include <Debug.au3> #include <String.au3> #include "WinHttp.au3" Local $hOpen = _WinHttpOpen() If @error Then MsgBox(48, "Error", "Error initializing the usage of WinHTTP functions.") Exit EndIf Local $Host = "messi.com" Local $hConnect = _WinHttpConnect($hOpen, $Host) ; <- yours here If @error Then MsgBox(48, "Error", "Error specifying the initial target server of an HTTP request.") _WinHttpCloseHandle($hOpen) Exit EndIf Local $req = _WinHttpOpenRequest($hConnect) If @error Then MsgBox(48, "Error", "Error creating an HTTP request handle.") _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Exit EndIf _WinHttpSendRequest($req) If @error Then MsgBox(48, "Error", "Error sending specified request.") _WinHttpCloseHandle($req) _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Exit EndIf _WinHttpReceiveResponse($req) ;------------------------ Wait for the response If @error Then MsgBox(48, "Error", "Error waiting for the response from the server.") _WinHttpCloseHandle($req) _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Exit EndIf Local $sChunk, $gsHTML If _WinHttpQueryDataAvailable($req) Then ;------------- See if there is data to read While 1 $sChunk = _WinHttpReadData($req) If @error Then ExitLoop $gsHTML &= $sChunk WEnd ConsoleWrite($gsHTML & @CRLF) ; print to console $aUrl = _ArrayUnique(StringRegExp($gsHTML, 'href=(?:"|'')([^"'']+)', 3)) For $i = 1 To UBound($aUrl) - 1 If (StringInStr($aUrl[$i], ".com") Or StringInStr($aUrl[$i], "www.")) And _ (Not StringInStr($aUrl[$i], $Host)) Then ;filter external domains differents than $Host=site.com ElseIf StringInStr($aUrl[$i], "http") And Not StringInStr($aUrl[$i], $Host) Then ;filter external domains differents than $Host=site.com ElseIf StringInStr($aUrl[$i], ".png") Or StringInStr($aUrl[$i], ".ico") Or _ StringInStr($aUrl[$i], ".jpg") Or StringInStr($aUrl[$i], ".jpeg") Or _ StringInStr($aUrl[$i], ".css") Then ;filter non desired elements ElseIf $aUrl[$i] = "" Or _ $aUrl[$i] = "/" Or _ $aUrl[$i] = $Host Or _ $aUrl[$i] = "http://" & $Host Or _ $aUrl[$i] = "https://" & $Host Or _ $aUrl[$i] = "http://www." & $Host Or _ $aUrl[$i] = "https://www." & $Host Or _ $aUrl[$i] = $Host & "/" Or _ $aUrl[$i] = "http://" & $Host & "/" Or _ $aUrl[$i] = "https://" & $Host & "/" Or _ $aUrl[$i] = "http://www." & $Host & "/" Or _ $aUrl[$i] = "https://www." & $Host & "/" Then ;filter same domain Endif Next Else MsgBox(48, "Error", "Site is experiencing problems.") EndIf _WinHttpCloseHandle($req) _WinHttpCloseHandle($hConnect) _WinHttpCloseHandle($hOpen) Link to comment Share on other sites More sharing options...
mikell Posted September 9, 2020 Share Posted September 9, 2020 (edited) Hmmmno. I meant, provide the whole string like the "$site" one in the code in post #1 Edit For now my pillow is waiting for me. This might give something : $res = StringRegExp($text, 'href=(?|"((?:[^"]+?/)+)|''((?:[^'']+?/)+))', 3) $res = _ArrayUnique($res) _ArrayDisplay($res) Edited September 9, 2020 by mikell jcpetu 1 Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 Yes, here:Test-File.txt Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 Thanks mikell I'm reviewing and I'll give you feedback. Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 mikell, I realized that in some cases the folder doesn't end with a slash but with the same open symbol (I'm sorry), for instance: 1) href="https://site.com/sub1/sub2" --> should bring https://site.com/sub1/sub2. The open symbol is ". 2) href='site.com/sub1' --> should bring site.com/sub1. The open symbol is '. 3) href='site.com' --> should bring site.com. It doesn't have any folders, it's only the site and the open symbol is '. For this cases begining with href=" your solution is fine: 4) href="https://site.com/sub1/sub2/bla bla bla.png" --> It brings:https://site.com/sub1/sub2/ which is OK. But not for this cases begining with href=' : 5) href='https://site.com/sub1/sub2/bla bla bla.png' --> It should bring:https://site.com/sub1/sub2/ but it brings nothing. To summarize: The expression should bring any thing beginning with: href=" or href=' and ending with the last slash before the closing symbol " or ' (if it exists) as in example 4), or in case the slash doesn't exist , the text up to the closing symbol " or ' as in examples 1, 2 and 3. Link to comment Share on other sites More sharing options...
jcpetu Posted September 9, 2020 Author Share Posted September 9, 2020 Thanks a lot for your time. Link to comment Share on other sites More sharing options...
Deye Posted September 10, 2020 Share Posted September 10, 2020 may as well try $patt = "\w+(?::)[\w./-]+\w+\/" jcpetu 1 Link to comment Share on other sites More sharing options...
jcpetu Posted September 10, 2020 Author Share Posted September 10, 2020 (edited) Hi Deye, this brings all http and https regardless of if it's part of a ref= or not. The only references it doesn't bring are those that doesn't terminate with slash. But I can use it either way assuming I'll loose some references until I get the silver bullet. Just another question, some references begin with https:\/, what should I change in your RegExp to catch those as well? Or perhaps with another RegExp ? Edited September 10, 2020 by jcpetu Link to comment Share on other sites More sharing options...
mikell Posted September 10, 2020 Share Posted September 10, 2020 jcpetu, Regex is not magic, it's logic. So I fear that your multiple requirements are too much demanding for this logic Just an example : 1) href="https://site.com/sub1/bla bla bla" --> should bring https://site.com/sub1/ 1) href="https://site.com/sub1/sub2" --> should bring https://site.com/sub1/sub2 Here "sub2" and "bla bla bla" can be anything so how do you expect a regex to be able to make the difference ? This will need to be treated manually Link to comment Share on other sites More sharing options...
jcpetu Posted September 10, 2020 Author Share Posted September 10, 2020 mikell, I apologize you right, with so many trial and error with RegExp I'm lost. By now I'll use Deye solution and I'll polish what I need manually. I appreciate a lot your time. Thanks a lot. Link to comment Share on other sites More sharing options...
jcpetu Posted September 13, 2020 Author Share Posted September 13, 2020 mikell, at last and after a lot trial and error I found the expression that does exactly what I want: $gaResult = StringRegExp($gsHTML, "https:\\/\\/[\w.\\/-]+\\/[""|\w.??==-]+|href=['|""|:/|#|\w./-]+[""|'|\w.??==-]+", 3) Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now