yyywww Posted March 14, 2019 Share Posted March 14, 2019 (edited) #include <Inet.au3> #include <Array.au3> $sUrl = "https://deadline.com/" $sRegEx = '(?<=(?:post-title">))((\n)|.)*?(?=(?:<p class="post-author-time))' $sHTML = _INetGetSource($sUrl) ;~ MsgBox(0,"",$sHTML) $aArticles = StringRegExp($sHTML,$sRegEx,3) ; get articles _ArrayDisplay($aArticles) ;~ ConsoleWrite($aArticles[0] & @CRLF) I want to do a simple get of HTML texts on this news site for each article. I know that this site has 12 articles on their front page, and the after I apply the regex to split each article into an array, I can see that it has 12 elements as well, but they are empty. I assume it has something to do with the linebreaks; because when I do the same but for just single lines, the elements in the array are no longer empty. How do I fix this to have the elements contain the article info and not be empty? Edited March 14, 2019 by yyywww Link to comment Share on other sites More sharing options...
FrancescoDiMuro Posted March 14, 2019 Share Posted March 14, 2019 @yyywww So you want to get everything between <p class="post-author-time"> and the end of the </p> ? Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette Link to comment Share on other sites More sharing options...
yyywww Posted March 14, 2019 Author Share Posted March 14, 2019 (edited) @FrancescoDiMuro Edit: No, it's actually everything inbetween post-title"> and <p class="post-author-time But, what exactly you get is not very important, it could obtain anything from this site; but it needs to be multiple lines at once (Because when I get single lines it does work). I'm more interested in why the array contains empty elements when I do it like this with the code above, or what I need to change in order to not have the array contain empty elements, but instead contain the HTML between those tags. Edited March 14, 2019 by yyywww Link to comment Share on other sites More sharing options...
FrancescoDiMuro Posted March 14, 2019 Share Posted March 14, 2019 (edited) @yyywww Something like this? #include <Array.au3> #include <Inet.au3> #include <StringConstants.au3> Global $strUrl = "https://deadline.com/", _ $strHTML = "", _ $arrResult $strHTML = _INetGetSource($strURL, True) $arrResult = StringRegExp($strHTML, '(?s)<h2 class="post-title">(.*?)<p class="post-author-time">', $STR_REGEXPARRAYGLOBALMATCH) _ArrayDisplay($arrResult) Edited March 14, 2019 by FrancescoDiMuro yyywww 1 Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette Link to comment Share on other sites More sharing options...
yyywww Posted March 14, 2019 Author Share Posted March 14, 2019 @FrancescoDiMuro With the help of your script I was able to narrow down the issue: In my faulty script I used (.)*?, but I should have used (.*?) instead. I also learned about the usage of (?s) which was very helpful. Thanks. Link to comment Share on other sites More sharing options...
FrancescoDiMuro Posted March 14, 2019 Share Posted March 14, 2019 @yyywww Happy to have helped and, you're welcome Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now