Jump to content

Decode HTML entities


Seminko
 Share

Recommended Posts

I'm doing a GET request and the data returned is HTML encoded.

I need it decoded to a readable string.

I found dozens of topics but none that would work well for all the chars.

Example of what I found:

Func DecodeHTMLChars($s)
    $t = Execute("'" & StringRegExpReplace($s, "(&#)(\d+)(;)", "' & ChrW($2) & '") & "'")
    Return $t
EndFunc

This only works for the basic entities.

However, I have entities like these:

Spoiler
’
é
—
á
í
Č
ä
–
ï
…
š
č
ô
ý
Ú
ě
ž
ů
 
“
”
ñ
Ş
è
¡
²
ı
̇
Ç
ü
ó
ö
‘
ã


​
¿
ğ
İ
ş
반
드
시
잡
는
다
-
ê
±
â
€
™
¯
́
&
Ł
ł
ś
ę
õ

 

Have I missed an UDF? Can anyone point me in the right direction?

Link to comment
Share on other sites

37 minutes ago, Danp2 said:

These appear to be unicode characters. Maybe this will help --

 

I've checked this one. This only works for deciding URL encoded chars like 'ka%C5%A1tan'

 

39 minutes ago, Nine said:

For what I see, each non &#xYYYY; matches a single char. You simply need to make a select case for all the possibilities.,,

I was worried that was the only solution. But TBH it surpises me, though, since this operation is more than common.

 

Thank both

Link to comment
Share on other sites

3 minutes ago, Seminko said:

since this operation is more than common

Agree with you.  On the other hand, it is quite an easy (but tedious) task, nobody felt it is worth an UDF.  It would kind of nice from you, if you could post the solution you are creating in here or in the examples section. 

Link to comment
Share on other sites

just a quick and wild test using the browser control as a decoder.

it seems to work, the returned string is OK, but why when placing that string into the Edit control using GUICtrlSetData @cr and @lf are lost?

to test, paste an entity string in the upper input and click button to decode it.

#include <GUIConstantsEx.au3>
#include <EditConstants.au3>

Global $oIE_Server

_Example()

Func _Example()
    Local $hW = GUICreate("Entity decoder", 470, 445, 230, 134)
    Local $Edit1 = GUICtrlCreateEdit("", 8, 8, 450, 200, BitOR($ES_MULTILINE, $ES_WANTRETURN, $ES_AUTOVSCROLL))
    GUICtrlSetData(-1, '&#x1D49C;&#x1D4CA;&#x1D4C9;&#x2134;&#x2110;&#x1D4C9;' & @CRLF & _
            '&#x260E;  &#x2640;  &#x2642;  &#x2660; &#x2663; &#x2665; &#x2666; ' & @CRLF & _
            '&#x1D49C;&#x1D4CA;&#x1D4C9;&#x2134;&#x2110;&#x1D4C9; &#x260E;  &#x2640;  &#x2642;  &#x2660; &#x2663; &#x2665; &#x2666;' & @CRLF & @CRLF)
    Local $Edit2 = GUICtrlCreateEdit("", 8, 238, 450, 200, BitOR($ES_MULTILINE, $ES_WANTRETURN, $ES_AUTOVSCROLL))
    GUICtrlSetFont(-1, 12)

    $Button1 = GUICtrlCreateButton("Decode Entity", 8, 210, 450, 25)
    GUISetState(@SW_SHOW)

    _SetupDecoder()
    Local $hDecoder = $oIE_Server.document.parentwindow.d3c0d3r

    While 1
        $nMsg = GUIGetMsg()
        Switch $nMsg
            Case $GUI_EVENT_CLOSE
                Exit
            Case $Button1
                $hDecoder.innerHTML = GUICtrlRead($Edit1)
                $sstr = $hDecoder.value
                MsgBox(0, 0, $sstr)
                ; ??? using GUICtrlSetData @cr are lost ???
                GUICtrlSetData($Edit2, $sstr); $hDecoder.value)
        EndSwitch
    WEnd
EndFunc   ;==>_Example

Func _SetupDecoder()
    $oIE_Server = ObjCreate("Shell.Explorer.2")
    GUICtrlCreateObj($oIE_Server, -10, -10, 5, 5)
    Sleep(3000)
    $oIE_Server.navigate("about:blank")
    Local $sHTML = _
            '<!DOCTYPE html>' & @CRLF & _
            '<html>' & @CRLF & _
            ' <head>' & @CRLF & _
            '  <meta http-equiv="X-UA-Compatible" content="IE=edge">' & @CRLF & _
            ' </head>' & @CRLF & _
            ' <body>' & @CRLF & _
            '<textarea id="d3c0d3r"cols="10" wrap="hard"> </textarea>' & @CRLF & _
            ' </body>' & @CRLF & _
            '</html>'
    $oIE_Server.document.Write($sHTML) ; inject lising directly to the HTML document:
    $oIE_Server.document.close() ; close the write stream
    $oIE_Server.document.execCommand("Refresh")
EndFunc   ;==>_SetupDecoder

 

 

image.jpeg.9f1a974c98e9f77d824b358729b089b0.jpeg Chimp

small minds discuss people average minds discuss events great minds discuss ideas.... and use AutoIt....

Link to comment
Share on other sites

I actually started creating a function for this only to find out that the 27k html entities I scraped were not enough to decode everything.

After a couple of hours, I decided to save myself the hustle, and more importantly the time, which is in short supply, now more then ever, and used Python - solved with one line of code...

Thank you all for the ideas, though!

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...