frank10 Posted February 8, 2020 Share Posted February 8, 2020 I'm trying to collect some info on some Amazon pages. I'm using IE.au3 like this example: local $url = "www.amazon.com/Very-Stable-Genius-Testing-America-ebook/dp/B07WQQRMGP/ref=zg_bs_157325011_5?_encoding=UTF8&psc=1&refRID=R5Z6WM52CT1EK1QA91YR" Local $oIE = _IECreate('', 0, 0) _IENavigate($oIE, $url) $sHTML = _IEBodyReadHTML($oIE) So, I tried InetGetSource but I get no readable char on response... I tried winHttp GET and I get this answer in the BODY page: <!-- To discuss automated access to Amazon data please contact api-services-support@amazon.com. For information about migrating to our APIs refer to our Marketplace APIs at ....... --> So, what could I do to get faster loading page? I tried also disabling Images loading with key registry and it improves, but I would like faster. At this time I get HTML code in 2 - 4 sec... Link to comment Share on other sites More sharing options...
Gianni Posted February 9, 2020 Share Posted February 9, 2020 try to specify the UserAgent of a browser so the amazon website believes that you are browsing with a browser and not with a script #include <InetConstants.au3> HttpSetUserAgent('Mozilla / 5.0') Local $url = "https://www.amazon.com/Very-Stable-Genius-Testing-America-ebook/dp/B07WQQRMGP/ref=zg_bs_157325011_5?_encoding=UTF8&psc=1&refRID=R5Z6WM52CT1EK1QA91YR" $sHTML = InetRead($url, $INET_FORCERELOAD + $INET_IGNORESSL) ConsoleWrite(BinaryToString($sHTML) & @CRLF) Chimp small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt.... Link to comment Share on other sites More sharing options...
Bert Posted February 9, 2020 Share Posted February 9, 2020 you could also use a filter to screen out anything that you don't want. So, for example you could block JavaScript. The Vollatran project My blog: http://www.vollysinterestingshit.com/ Link to comment Share on other sites More sharing options...
frank10 Posted February 10, 2020 Author Share Posted February 10, 2020 11 hours ago, Chimp said: try to specify the UserAgent of a browser so the amazon website believes that you are browsing with a browser and not with a script #include <InetConstants.au3> HttpSetUserAgent('Mozilla / 5.0') Local $url = "https://www.amazon.com/Very-Stable-Genius-Testing-America-ebook/dp/B07WQQRMGP/ref=zg_bs_157325011_5?_encoding=UTF8&psc=1&refRID=R5Z6WM52CT1EK1QA91YR" $sHTML = InetRead($url, $INET_FORCERELOAD + $INET_IGNORESSL) ConsoleWrite(BinaryToString($sHTML) & @CRLF) thank you Chimp, that works, but it's not faster than IE. Probably, as Bert said, it can be created a filter to speed up. But, insted I got a way to make IE faster: _IENavigate($oIE, $url,0) This way it does not wait until the page is loaded and I can check for what I want loaded with $oIE.document.body.innerHTML Link to comment Share on other sites More sharing options...
Bert Posted February 10, 2020 Share Posted February 10, 2020 (edited) One of the things I've noticed with many web pages today is the sequence of which a page loads. You will see the ads load first and many times you will also see "waiting on..." page to load. Meanwhile, you are staring at the ad waiting on the page to load. I firmly believe the wait times are deliberate just to make you look at the ad. It's why I use an ad blocker. I would not have nearly as much issue with ads if it wasn't for this "feature". When ads are block, the page loads fast. With ads - slow load. My ad blocker of choice is using a DNS server. That way all of your devices are covered. Edited February 10, 2020 by Bert The Vollatran project My blog: http://www.vollysinterestingshit.com/ Link to comment Share on other sites More sharing options...
Danp2 Posted February 10, 2020 Share Posted February 10, 2020 Pi-hole FTW! Latest Webdriver UDF Release Webdriver Wiki FAQs Link to comment Share on other sites More sharing options...
frank10 Posted February 10, 2020 Author Share Posted February 10, 2020 (edited) mmm after some working tests, from time to time I get again: robot.txt <p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p> it's a captcha code... so I can't do this way. Patience... Edited February 10, 2020 by frank10 Link to comment Share on other sites More sharing options...
frank10 Posted February 10, 2020 Author Share Posted February 10, 2020 (edited) I used many chrome extensions for Amazon and they work flawlessly in parsing Amazon pages... Do you know how chrome extensions for Amazon can parse all Amazon pages without this annoying captchas? Is there a way to make autoit+IE behave like a chrome extension to parse data? Edited February 10, 2020 by frank10 Link to comment Share on other sites More sharing options...
Danp2 Posted February 10, 2020 Share Posted February 10, 2020 You didn't provide any examples of Chrome extensions, so I can only assume that they are reading the HTML from a loaded page in the browser. If that's correct, then IMO it's no different than what you are doing with the _IE commands. You could use the Webdriver UDF to do the same thing, but it may not be any faster than what you already have using IE. To avoid the "robot check", you may need to throw in some random pauses to your script. However, you may want to review their TOS as I suspect you may be running afoul with it. Latest Webdriver UDF Release Webdriver Wiki FAQs Link to comment Share on other sites More sharing options...
frank10 Posted February 10, 2020 Author Share Posted February 10, 2020 For example there is the extension KDspy that you start from an Amazon page that contains 20-50 books in it. The extension starts loading in the background each book page url to get some data inside it and display it as a summary in its window. It takes about 1'' per book. With my method with IE I got similar speed result, but now it starts this captcha thing... and it puts them also if I leave the LoadWait that slowes down it to 3-4'' per book. I would like to reproduce the KDspy beahviour if possible. Random video just to see it in action: But, yes, after searching a bit, it seems one must scrape data with their API, in AWS instead of normal HTML pages... Link to comment Share on other sites More sharing options...
Earthshine Posted February 10, 2020 Share Posted February 10, 2020 (edited) Inches per book? What? Edited February 10, 2020 by Earthshine Nine 1 My resources are limited. You must ask the right questions Link to comment Share on other sites More sharing options...
frank10 Posted February 11, 2020 Author Share Posted February 11, 2020 (edited) 11 hours ago, Earthshine said: Inches per book? What? 1'' means 1 second. '' is the abbreviation for seconds. It can mean also inches... ( ' means minutes) Edited February 11, 2020 by frank10 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now