Jump to content

Extract text from HTML


Recommended Posts

hello

I found the the below code in the forum and it works only to extract the title 

#Include <String.au3>
#Include <INET.au3>

$html = _StringBetween(_INetGetSource('http://somdcomputerguy.com'), '<title>', '</title>')
MsgBox(0, "title", $html[0])

 

when I put $html = _StringBetween(_INetGetSource('http://www.google.com/search?q=autoit', '<h3 ….>', '</h3>') MsgBox(0, "title", $html[0])

it doesn't work, is maybe because it finds many </h3>? can you please point me to the right direction?

Link to comment
Share on other sites

  • Developers

First of all I like to state it is somewhat impolite to crosspost questions (posting them multiple times).
As to your question: 

1: The posted line has an error in the syntax so will not run:

$html = _StringBetween(_INetGetSource('http://www.google.com/search?q=autoit', '<h3 ….>', '</h3>') 
; --- should be
$html = _StringBetween(_INetGetSource('http://www.google.com/search?q=autoit'), '<h3 ….>', '</h3>') 
MsgBox(0, "title", $html[0])

2: When you run this code it tells you what is wrong with the _StringBetween():

#Include <String.au3>
#Include <INET.au3>
$html = _StringBetween(_INetGetSource('http://www.google.com/search?q=autoit'), '<h3 ….>', '</h3>')
ConsoleWrite('@@ Debug(' & @ScriptLineNumber & ') >Error code: ' & @error & @CRLF) ;### Debug Console

As that returns "Error code: 1" and the helpfile tell you that in that case: "@error: 1 - No strings found. "

So what exactly were you expecting this start parameter to find: '<h3 ….>' ?

Jos

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

@Nina
Always check for errors in your code, and you were missing a round parenthesis in INetGetSource() function, and the three dots in the function were definitely not helping the research.
The code below shows how to see if there are errors returned by the functions you are using; by the way, the string <h3> is not in the HTML source code, so it will always returns error 1:

#include <Inet.au3>
#include <String.au3>

Global $strHTLM = _INetGetSource('http://www.google.com/search?q=autoit')
If @error Then
    ConsoleWrite("_INetGetSource ERR: " & @error & @CRLF)
    Exit
Else
    ConsoleWrite($strHTLM & @CRLF)

    $strHTLM = _StringBetween($strHTLM, '<h3>', '</h3>')
    If @error Then
        ConsoleWrite("_StringBetween ERR: " & @error & @CRLF)
    Else
        MsgBox(0, "", $strHTLM[0])
    EndIf
EndIf

 

Edited by FrancescoDiMuro

Click here to see my signature:

Spoiler

ALWAYS GOOD TO READ:

 

Link to comment
Share on other sites

Jos, I'm sorry, didn't mean to be impolite, after posting the question in the topic which was already posted, I noticed that it's very old and though that maybe it was wrong to post it there. so I opened a new topic.

As per my question, I must have deleted the parenthese by mistake while typing my question, anyway, it's not working(even with the parenthese)..

Link to comment
Share on other sites

  • Developers
1 hour ago, Nina said:

I think my post is not very clear, for the '<h3 ….>'  it's just an exemple, The text I would like to extract is between <h3 class="LC20lb DKV0Md"> and </h3>

 

Don't provide "just an example" which doesn't make any sense, but rather provide an actual case that isn't working so we can help you.  ;) 

Edited by Jos

SciTE4AutoIt3 Full installer Download page   - Beta files       Read before posting     How to post scriptsource   Forum etiquette  Forum Rules 
 
Live for the present,
Dream of the future,
Learn from the past.
  :)

Link to comment
Share on other sites

what I'm trying to do is to search a word in google, then return the title of the first 3 links that were found. I checked the HTTML code of the google research page and the title is between  <h3 class="LC20lb DKV0Md"> and </h3> .

I have updated my code, and for now I managed to have it partially working(it returns the first title) but, there is an issue somewhere, because it only returns the first title and then it shows the following error

Quote

--> IE.au3 T3.0-2 Error from function _IETagNameGetCollection, $_IESTATUS_InvalidDataType
"C:\Users\IEUser\Desktop\AutoIT\testIE.au3" (76) : ==> Variable must be of type "Object".:

 

_IETagNameGetCollection($findH3, "h3")

 

 

Edited by Nina
Link to comment
Share on other sites

https://www.autoitscript.com/autoit3/docs/libfunctions/_IETagNameGetCollection.htm

if you saw the help file's example for the _IETagNameGetCollection function, then you must have seen that they used $oIE as the first parameter.  Being new to AutoIt, why did you choose to use some none existent object instead of trying to follow the example?

In your _IETagNameGetCollection function, change $findH3 to $oIE (like the help file's example) and see if that works.

Link to comment
Share on other sites

Works fine:

 

#include <IE.au3>
#include <Array.au3>

Local $oIE = _IECreate("https://www.google.com/search?q=autoit") 

Local $oItems = _IETagNameGetCollection($oIE, "h3")
Local $aTitle_Table[3]
Local $iUbound = UBound($aTitle_Table)
Local $iCount = 0

For $oItem In $oItems
    If StringLeft($oItem.ClassName, 2) <> 'LC' Then
        ContinueLoop
    EndIf
    
    $aTitle_Table[$iCount] = $oItem.InnerText
    $iCount += 1
    
    If $iCount >= $iUbound Then
        ExitLoop
    EndIf
Next

_IEQuit($oIE)
_ArrayDisplay($aTitle_Table)

 

Edited by MrCreatoR

 

Spoiler

Using OS: Win 7 Professional, Using AutoIt Ver(s): 3.3.6.1 / 3.3.8.1

AutoIt_Rus_Community.png AutoIt Russian Community

My Work...

Spoiler

AutoIt_Icon_small.pngProjects: ATT - Application Translate Tool {new}| BlockIt - Block files & folders {new}| SIP - Selected Image Preview {new}| SISCABMAN - SciTE Abbreviations Manager {new}| AutoIt Path Switcher | AutoIt Menu for Opera! | YouTube Download Center! | Desktop Icons Restorator | Math Tasks | KeyBoard & Mouse Cleaner | CaptureIt - Capture Images Utility | CheckFileSize Program

AutoIt_Icon_small.pngUDFs: OnAutoItErrorRegister - Handle AutoIt critical errors {new}| AutoIt Syntax Highlight {new}| Opera Library! | Winamp Library | GetFolderToMenu | Custom_InputBox()! | _FileRun UDF | _CheckInput() UDF | _GUIInputSetOnlyNumbers() UDF | _FileGetValidName() UDF | _GUICtrlCreateRadioCBox UDF | _GuiCreateGrid() | _PathSplitByRegExp() | _GUICtrlListView_MoveItems - UDF | GUICtrlSetOnHover_UDF! | _ControlTab UDF! | _MouseSetOnEvent() UDF! | _ProcessListEx - UDF | GUICtrl_SetResizing - UDF! | Mod. for _IniString UDFs | _StringStripChars UDF | _ColorIsDarkShade UDF | _ColorConvertValue UDF | _GUICtrlTab_CoverBackground | CUI_App_UDF | _IncludeScripts UDF | _AutoIt3ExecuteCode | _DragList UDF | Mod. for _ListView_Progress | _ListView_SysLink | _GenerateRandomNumbers | _BlockInputEx | _IsPressedEx | OnAutoItExit Handler | _GUICtrlCreateTFLabel UDF | WinControlSetEvent UDF | Mod. for _DirGetSizeEx UDF
 
AutoIt_Icon_small.pngExamples: 
ScreenSaver Demo - Matrix included | Gui Drag Without pause the script | _WinAttach()! | Turn Off/On Monitor | ComboBox Handler Example | Mod. for "Thinking Box" | Cool "About" Box | TasksBar Imitation Demo

Like the Projects/UDFs/Examples? Please rate the topic (up-right corner of the post header: Rating AutoIt_Rating.gif)

* === My topics === *

==================================================
My_Userbar.gif
==================================================

 

 

 

AutoIt is simple, subtle, elegant. © AutoIt Team

Link to comment
Share on other sites

6 minutes ago, TheXman said:

https://www.autoitscript.com/autoit3/docs/libfunctions/_IETagNameGetCollection.htm

if you saw the help file's example for the _IETagNameGetCollection function, then you must have seen that they used $oIE as the first parameter.  Being new to AutoIt, why did you choose to use some none existent object instead of trying to follow the example?

In your _IETagNameGetCollection function, change $findH3 to $oIE (like the help file's example) and see if that works.

Thank you very much! 

Link to comment
Share on other sites

Link to comment
Share on other sites

Curiously, the tag <h3 class="LC20lb DKV0Md"> does exist, but then i use something like this:

#Include <INET.au3>
$source = _INetGetSource('http://www.google.com/search?q=autoit')
ConsoleWrite($source &@CRLF)

And then search the text for those words, they're not in it.

Shouldn't they be retrieved?

Spoiler

Renamer - Rename files and folders, remove portions of text from the filename etc.

GPO Tool - Export/Import Group policy settings.

MirrorDir - Synchronize/Backup/Mirror Folders

BeatsPlayer - Music player.

Params Tool - Right click an exe to see it's parameters or execute them.

String Trigger - Triggers pasting text or applications or internet links on specific strings.

Inconspicuous - Hide files in plain sight, not fully encrypted.

Regedit Control - Registry browsing history, quickly jump into any saved key.

Time4Shutdown - Write the time for shutdown in minutes.

Power Profiles Tool - Set a profile as active, delete, duplicate, export and import.

Finished Task Shutdown - Shuts down pc when specified window/Wndl/process closes.

NetworkSpeedShutdown - Shuts down pc if download speed goes under "X" Kb/s.

IUIAutomation - Topic with framework and examples

Au3Record.exe

Link to comment
Share on other sites

2 hours ago, careca said:

then search the text for those words, they're not in it.

Because browser handles other stuff while loading the page, using InetRead you load raw page data.

This is how we can get titles in this case:

#Include <Array.au3>
#Include <INET.au3>

$sSource = _INetGetSource('http://www.google.com/search?q=autoit')
$aTitles = StringRegExp($sSource, '<a href="/url\?q=.+?><div class=".*?"><span dir=".*?">(.*?)</span>', 3)
_ArrayDisplay($aTitles)

 

 

Spoiler

Using OS: Win 7 Professional, Using AutoIt Ver(s): 3.3.6.1 / 3.3.8.1

AutoIt_Rus_Community.png AutoIt Russian Community

My Work...

Spoiler

AutoIt_Icon_small.pngProjects: ATT - Application Translate Tool {new}| BlockIt - Block files & folders {new}| SIP - Selected Image Preview {new}| SISCABMAN - SciTE Abbreviations Manager {new}| AutoIt Path Switcher | AutoIt Menu for Opera! | YouTube Download Center! | Desktop Icons Restorator | Math Tasks | KeyBoard & Mouse Cleaner | CaptureIt - Capture Images Utility | CheckFileSize Program

AutoIt_Icon_small.pngUDFs: OnAutoItErrorRegister - Handle AutoIt critical errors {new}| AutoIt Syntax Highlight {new}| Opera Library! | Winamp Library | GetFolderToMenu | Custom_InputBox()! | _FileRun UDF | _CheckInput() UDF | _GUIInputSetOnlyNumbers() UDF | _FileGetValidName() UDF | _GUICtrlCreateRadioCBox UDF | _GuiCreateGrid() | _PathSplitByRegExp() | _GUICtrlListView_MoveItems - UDF | GUICtrlSetOnHover_UDF! | _ControlTab UDF! | _MouseSetOnEvent() UDF! | _ProcessListEx - UDF | GUICtrl_SetResizing - UDF! | Mod. for _IniString UDFs | _StringStripChars UDF | _ColorIsDarkShade UDF | _ColorConvertValue UDF | _GUICtrlTab_CoverBackground | CUI_App_UDF | _IncludeScripts UDF | _AutoIt3ExecuteCode | _DragList UDF | Mod. for _ListView_Progress | _ListView_SysLink | _GenerateRandomNumbers | _BlockInputEx | _IsPressedEx | OnAutoItExit Handler | _GUICtrlCreateTFLabel UDF | WinControlSetEvent UDF | Mod. for _DirGetSizeEx UDF
 
AutoIt_Icon_small.pngExamples: 
ScreenSaver Demo - Matrix included | Gui Drag Without pause the script | _WinAttach()! | Turn Off/On Monitor | ComboBox Handler Example | Mod. for "Thinking Box" | Cool "About" Box | TasksBar Imitation Demo

Like the Projects/UDFs/Examples? Please rate the topic (up-right corner of the post header: Rating AutoIt_Rating.gif)

* === My topics === *

==================================================
My_Userbar.gif
==================================================

 

 

 

AutoIt is simple, subtle, elegant. © AutoIt Team

Link to comment
Share on other sites

I get nothing at all, i mean not even the arraydisplay window. But since this is not my thread, and not my issue, let's leave it at that. Thanks.

Spoiler

Renamer - Rename files and folders, remove portions of text from the filename etc.

GPO Tool - Export/Import Group policy settings.

MirrorDir - Synchronize/Backup/Mirror Folders

BeatsPlayer - Music player.

Params Tool - Right click an exe to see it's parameters or execute them.

String Trigger - Triggers pasting text or applications or internet links on specific strings.

Inconspicuous - Hide files in plain sight, not fully encrypted.

Regedit Control - Registry browsing history, quickly jump into any saved key.

Time4Shutdown - Write the time for shutdown in minutes.

Power Profiles Tool - Set a profile as active, delete, duplicate, export and import.

Finished Task Shutdown - Shuts down pc when specified window/Wndl/process closes.

NetworkSpeedShutdown - Shuts down pc if download speed goes under "X" Kb/s.

IUIAutomation - Topic with framework and examples

Au3Record.exe

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...