stefionesco Posted December 7, 2018 Share Posted December 7, 2018 Hi guys. There is any possibility to extract in a CSV file the IMDb Top 250(https://www.imdb.com/chart/top) ? Or better... The Top Indian Movies.., https://www.imdb.com/india/top-rated-indian-movies/ Thanks in advance Link to comment Share on other sites More sharing options...
Subz Posted December 7, 2018 Share Posted December 7, 2018 I'm bored so here is basic example: #include <Array.au3> #include <IE.au3> Local $oIE = _IECreate("https://www.imdb.com/india/top-rated-indian-movies", 1) Sleep(4000) Local $oMovies = _IETableGetCollection($oIE) For $oMovie In $oMovies If $oMovie.ClassName = "chart full-width" Then $aMovies = _IETableWriteToArray($oMovie, True) ExitLoop EndIf Next Local $sMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.csv" Local $hMovies = FileOpen($sMovies, 10) ;~ Create directory + overwrite contents Filewrite($hMovies, '"' & _ArrayToString($aMovies, '","', -1, -1, '"' & @CRLF & '"', 1, 2) & '"' & @CRLF) FileClose($hMovies) _ArrayDisplay($aMovies) stefionesco 1 Link to comment Share on other sites More sharing options...
stefionesco Posted December 8, 2018 Author Share Posted December 8, 2018 Oh, something like this, but I need more information from IMDb, not only the title, year and rating. I want the IMDb ID movie to be included. I can't manage how to get it. @Subz, If you said you're bored, please help me. Link to comment Share on other sites More sharing options...
FrancescoDiMuro Posted December 8, 2018 Share Posted December 8, 2018 37 minutes ago, stefionesco said: @Subz, If you said you're bored, please help me. I think that a little bit of effort from your side should be showed. No-one is here to code for you, as it is stated in the Forum etiquette. czardas 1 Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette Link to comment Share on other sites More sharing options...
stefionesco Posted December 8, 2018 Author Share Posted December 8, 2018 17 minutes ago, FrancescoDiMuro said: I think that a little bit of effort from your side should be showed. No-one is here to code for you, as it is stated in the Forum etiquette. I am trying, man, I don't want to be lazy, but I didn't get at any point. That's why I try to get some help from you. I do not want to be rude... Sorry if you understand this. Link to comment Share on other sites More sharing options...
Subz Posted December 8, 2018 Share Posted December 8, 2018 Not 100% sure what the IMDb id is, I believe it's the code after the "title" so here is how I'd get it. expandcollapse popup#include <Array.au3> #include <IE.au3> Local $oIE = _IECreate("https://www.imdb.com/india/top-rated-indian-movies", 1) Sleep(4000) Local $aMovies[0][4] Local $oMovies = _IETableGetCollection($oIE) If IsObj($oMovies) Then For $oMovie In $oMovies If $oMovie.ClassName = "chart full-width" Then $oRows = _IETagNameGetCollection($oMovie, "tr") If IsObj($oRows) Then For $oRow In $oRows ReDim $aMovies[UBound($aMovies) + 1][4] $iMovies = UBound($aMovies) - 1 $oCells = _IETagNameGetCollection($oRow, "td") For $oCell In $oCells If $oCell.ClassName = "titleColumn" Then $aMovies[$iMovies][0] = $oCell.InnerText $oLinks = _IETagNameGetCollection($oCell, "a") If IsObj($oLinks) Then For $olink In $oLinks $aMovies[$iMovies][3] = StringLeft($olink.href, StringInStr($olink.href, "/", 0, 5)) $aMovies[$iMovies][2] = StringTrimRight(StringReplace($aMovies[$iMovies][3], "https://www.imdb.com/title/", ""), 1) Next EndIf EndIf If $oCell.ClassName = "ratingColumn imdbRating" Then $aMovies[$iMovies][1] = $oCell.InnerText Next Next EndIf ExitLoop EndIf Next EndIf $aMovies[0][0] = "Title" $aMovies[0][1] = "IMDb Rating" $aMovies[0][2] = "IMDb Id" $aMovies[0][3] = "IMDb Url" Local $sMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.csv" Local $hMovies = FileOpen($sMovies, 10) ;~ Create directory + overwrite contents Filewrite($hMovies, '"' & _ArrayToString($aMovies, '","', -1, -1, '"' & @CRLF & '"') & '"' & @CRLF) FileClose($hMovies) _ArrayDisplay($aMovies) Link to comment Share on other sites More sharing options...
stefionesco Posted December 8, 2018 Author Share Posted December 8, 2018 10 minutes ago, Subz said: Not 100% sure what the IMDb id is, I believe it's the code after the "title" so here is how I'd get it. Thanks. Seem to what I'm looking for. Now I can try to go ahead with my coding. PS. And yes, the IMDb ID is that number after the title (ttxxxxxxx) Thanks again PS2. After I finish what I have in mind I will post it here. Link to comment Share on other sites More sharing options...
stefionesco Posted December 8, 2018 Author Share Posted December 8, 2018 Why doesn't work with other links, like an IMDb list? Sample link: https://www.imdb.com/list/ls045397191/ With Top 250 list, for instance, is working fine. Link to comment Share on other sites More sharing options...
Somerset Posted December 8, 2018 Share Posted December 8, 2018 Lazy coder. Help me i have a broken will to code. Help me, help me. I have no effort of my own, nor code to show for myself. Boo hoo. FrancescoDiMuro 1 Link to comment Share on other sites More sharing options...
Subz Posted December 8, 2018 Share Posted December 8, 2018 You're scraping web data so you need some basic html knowledge, I normally use Chrome and inspect each element that you need to capture, you need to identify unique information, for example <div id="xyz"> is better than <div class="xyz"> since id should only be used once per page (if coded correctly). Class names are normally used throughout the document, however in most instances, people will use class names like in the example I posted above so that all titles have a class name of "titleColumn", making it easy identify. If you look at the link you posted and inspect the elements of the page you'll notice it doesn't use tables, but is using divs. Each title has a class name named "lister-item-content", you'll note the heading "h3" is the title and holds the url. So start with: $oDivs = _IETagNameGetCollection($oIE, "div") Loop and look for $oDiv.ClassName = "lister-item-content" _IETagNameGetCollection($oDiv, "h3") $oH3.InnerText will be your title Use the code I posted above to get the links. If you encounter any issues post your code and we can assist. czardas 1 Link to comment Share on other sites More sharing options...
stefionesco Posted December 9, 2018 Author Share Posted December 9, 2018 (edited) 1. First, I want to say I'm no coder, so if Somerset say I'm a lazy coder, I take it as a compliment. I just learn some basic GUI functions, here, on this forum, on some YouTube tutorials and that's it. I have always tried to adapt on my needs the codes I found here. This time I didn't get the result, that's why I ask for help. I'm not lazy. I just do not have the knowledge to understand and build something that I have in mind. Sorry if I offend somebody. 2. For those who still want to help me, especially for Subz who tried to explain me how to get the div class... I tried. I find the class, but it didn't work. I do something wrong for sure. To make it simple, here is the code, your code, that I tried to modified to fit my needs: expandcollapse popup#include <Array.au3> #include <IE.au3> Local $oIE = _IECreate("https://www.imdb.com/india/top-rated-indian-movies/", 1) SplashTextOn("Working", "Please wait...", 600, 50) Sleep(4000) Local $aMovies[0][4] Local $oMovies = _IETableGetCollection($oIE) If IsObj($oMovies) Then For $oMovie In $oMovies If $oMovie.ClassName = "chart full-width" Then $oRows = _IETagNameGetCollection($oMovie, "tr") If IsObj($oRows) Then For $oRow In $oRows ReDim $aMovies[UBound($aMovies) + 1][4] $iMovies = UBound($aMovies) - 1 $oCells = _IETagNameGetCollection($oRow, "td") For $oCell In $oCells If $oCell.ClassName = "titleColumn" Then $aMovies[$iMovies][0] = $oCell.InnerText $oLinks = _IETagNameGetCollection($oCell, "a") If IsObj($oLinks) Then For $olink In $oLinks $aMovies[$iMovies][3] = StringLeft($olink.href, StringInStr($olink.href, "/", 0, 5)) $aMovies[$iMovies][1] = StringTrimRight(StringReplace($aMovies[$iMovies][3], "https://www.imdb.com/title/tt", ""), 1) Next EndIf EndIf If $oCell.ClassName = "ratingColumn seen-widget rated inline rating " Then $aMovies[$iMovies][2] = $oCell.InnerText Next Next EndIf ExitLoop EndIf Next EndIf SplashOff() $aMovies[0][0] = "Title" $aMovies[0][1] = "IMDb ID" $aMovies[0][2] = "My Rating" Local $sMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.ini" Local $hMovies = FileOpen($sMovies, 10) ;~ Create directory + overwrite contents Filewrite($hMovies, '' & _ArrayToString($aMovies, '= ', 1, -1, '' & @CRLF)) FileClose($hMovies) _ArrayDisplay($aMovies) Problems: 1. I change CSV file into INI file. Later I prefer to have it in INI format. This seems to be no problem, still I think an INI file need to have [sections]. 2. (This is tough) I'm not interested about IMDb rating, instead I need to have my ratings. The id class for "my rating"... I found it but it didn't work for me. The result is an empty column. Anyway, in the aMovies file I need to exclude titles I already rated. Something like... If my rating is null Then write on file Else (if there is a rating already) ignore the line. I know I can do it after, in Excel with the CSV file but it will be more easier to have the INI file without that movies i've seen.. 3. The final INI file needs to have only 2 columns of the array (Title = IMDB Id). In the code above (that have 4 columns) I can't realize where I can change that. I mean I know where, in FileWrite but I can't find the right expression. Thank you. PS. Even if nobody will help me, thanks anyway for all the things I've learn on this forum. Edited December 9, 2018 by stefionesco Link to comment Share on other sites More sharing options...
FrancescoDiMuro Posted December 9, 2018 Share Posted December 9, 2018 (edited) @stefionesco Let @Somerset go! He was joking as it does with a lot of people around here, so, don't mind him! For your requests, a Database seems to be more "appropriate", since you could query it and do almost everything, instead of doing in your script (for example, you could think to extract only films of a particular genere, or which have a rating more than a value, and si on...). By the way, if you still want to use INI files, then take a look at Ini* functions in thr Help filr, instead of using File* functions to write to your file Edited December 9, 2018 by FrancescoDiMuro Click here to see my signature: Spoiler ALWAYS GOOD TO READ: Forum Rules Forum Etiquette Link to comment Share on other sites More sharing options...
Subz Posted December 10, 2018 Share Posted December 10, 2018 Here is an example of how to get the page list method (your second url) and also add it to an Ini file, in your code above the classname you should be looking for is "ratingColumn" the code you posted was for a div not the cell i.e. "td" expandcollapse popup#include <Array.au3> #include <IE.au3> Local $oIE = _IECreate("https://www.imdb.com/list/ls045397191", 1) Sleep(4000) Local $aMovies[0][4] Local $oDivs = _IETagNameGetCollection($oIE, "div") If IsObj($oDivs) Then For $oDiv In $oDivs If $oDiv.ClassName = "lister-item-content" Then ReDim $aMovies[UBound($aMovies) + 1][4] $iMovies = UBound($aMovies) - 1 $oHeading3s = _IETagNameGetCollection($oDiv, "h3") If IsObj($oHeading3s) Then For $oHeading3 In $oHeading3s $aMovies[$iMovies][0] = $oHeading3.InnerText SplashTextOn("IMDb Extractor", $aMovies[$iMovies][0], 400, 50) $oLinks = _IETagNameGetCollection($oHeading3, "a") If IsObj($oLinks) Then For $olink In $oLinks $aMovies[$iMovies][3] = StringLeft($olink.href, StringInStr($olink.href, "/", 0, 5)) $aMovies[$iMovies][2] = StringTrimRight(StringReplace($aMovies[$iMovies][3], "https://www.imdb.com/title/", ""), 1) Next EndIf Next EndIf $oLabels = _IETagNameGetCollection($oDiv, "label") If IsObj($oLabels) Then For $oLabel In $oLabels If $oLabel.ClassName = "ipl-rating-interactive__star-container" Then $aMovies[$iMovies][1] = StringStripWS($oLabel.InnerText, 3) EndIf Next EndIf EndIf Next EndIf SplashOff() _ArrayInsert($aMovies, 0, "Title|IMDb Rating|IMDb Id|IMDb Url") Local $sCsvMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.csv" Local $sIniMovies = @ScriptDir & "\" & @YEAR & "-" & @MON & "-" & @MDAY & "_Top-Rated-Indian-Movies.ini" Local $hMovies = FileOpen($sCsvMovies, 10) ;~ Create directory + overwrite contents Filewrite($hMovies, '"' & _ArrayToString($aMovies, '","', -1, -1, '"' & @CRLF & '"') & '"' & @CRLF) FileClose($hMovies) For $i = 1 To UBound($aMovies) - 1 ;~ Check to see if the movie has already been rated, if not continue. If IniRead($sIniMovies, $aMovies[$i][2], "My Rating", "") = "" Then IniWrite($sIniMovies, $aMovies[$i][2], "Title", $aMovies[$i][0]) IniWrite($sIniMovies, $aMovies[$i][2], "My Rating", $aMovies[$i][1]) EndIf Next _ArrayDisplay($aMovies) Link to comment Share on other sites More sharing options...
stefionesco Posted December 10, 2018 Author Share Posted December 10, 2018 I'm not getting at any point. I can't build the file INI as I want. Here is the example: Top-Rated-Indian-Movies.ini And I didn't figure out how to exclude the rated movies and keep only the unseen ones. I modify the code in every mode, except the correct one. I'm Link to comment Share on other sites More sharing options...
Moderators JLogan3o13 Posted December 10, 2018 Moderators Share Posted December 10, 2018 @stefionesco While your project seems fairly innocuous, it has been pointed out that IMDB's Conditions of Use page states very clearly: Quote Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below. As I am guessing that you do not possess this in writing, I am locking this thread based on our forum rules. Please read these and familiarize yourself before posting again. FrancescoDiMuro 1 "Profanity is the last vestige of the feeble mind. For the man who cannot express himself forcibly through intellect must do so through shock and awe" - Spencer W. Kimball How to get your question answered on this forum! Link to comment Share on other sites More sharing options...
Recommended Posts