Jump to content

Recommended Posts

Posted

Thanks Nine, I'll open a new thread when the final script is ready. For now I got issues with the maximum number of characters Google allows for translation when using this type of url :

$sUrl = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=en&dt=t&q=" & $sMytext

what is the maximum length $sMytext can reach, so google accepts to translate the string (from a script or from a web browser) ?

Actually my tests show this :

1) A string of 16332 ascii characters is accepted (ascii = 0-127) for example "aaa...aaa" (16332 length) and a valid Json.txt file is returned from Chrome Browser. So far so good.

If you add just one ascii character to this string (reaching 16333) then nothing is translated (from the script or from the Browser) and error 400 appears when launched from Chrome browser :

Chrome-toolongstringtotranslate.png.81f5af13e632aeff4bde3add30210380.png

If launched from an AutoIt script, the error should be searched inside the html code returned by the following line :

$sResponse = $oHTTP.ResponseText

which contains these infos :

...
<title>Error 411 (Length Required)!!1</title>
...
<p><b>411.</b> <ins>That’s an error.</ins>
  <p>POST requests require a <code>Content-length</code> header.  <ins>That’s all we know.</ins>

So, error 400 from the Browser or error 411 from the script, there is (gladly) an error in both cases, which indicates that the error is not only related to the AutoIt script.

2) Now let's try this : instead of a string like "aaa...aaa" , let's replace each "a" with "%61" which is a syntax accepted by Google (that's UTF-8 code of "a" => 0x61 with the required % replacing 0x in the string)

If we do this, we're actually replacing one character "a" with "%61" (which has a length of 3), so the maximum numbers of characters in $sMytext will be divide by 3 with this kind of string : "%61%61%61...%61%61%61"

16332 / 3 = 5444 ascii characters allowed (in script or Browser) when using syntax "%61" (tested) : add 1 character and you got the same errors (400 / 411) described above.

3) If all your characters are coded on 2 bytes, for example a long string of л (russian) becomes 0xD0BB in UTF-8 and needs to be prepared as "%D0%DB" in $sMytext, then this single unicode л got a length of 6 in $sMytext string, so the maximum is decreasing again to :

16332 / 6 = 2722 unicode characters if they're all coded on 2 bytes "%D0%DB%D0%DB%D0%DB...%D0%DB%D0%DB%D0%DB" . Adding 1 ascii/unicode to this string will generate the errors described above.

4) So it seems difficult (for me) to know in advance what is the maximum number of characters I should allow to paste in the Edit field containing the original language.

For the record, Google allows 5000 characters (no matter ascii, ansi, unicode !) on their translation site, for example :

https://translate.google.com/?hl=en&sl=auto&tl=en&text=Wie geht's %3F&op=translate

Chrome-5000Chars-Tcharsaccepted.png.a940d55f1d9ae459784542fed0957f09.png

We see in this pic "12 / 5000" so it's great for the user to know how many characters are still available. But well... it's google translation site and they can do whatever they want on it. I'm not sure it's easily doable with an AutoIt script. For the record, they added the limitation of 5000 characters on December 2016 (wiki) as it was probably much more before 2016 and the whole world used it for looong translations.

Facing this "issue", I guess I'll have to rewrite (at least) the part of the script where I'm preparing the string $sMytext, allowing as much of possible the use of plain ascii characters (when possible and compatible with google syntax) so they'll be 1 length for the string ("a") and no more use of ("%61") which got a length of 3 . This should allow to have a longer original text to translate. I just found some interesting functions to do this, in this thread

If you guys got ideas on all this, don't hesitate to share your point of view.
Thanks for reading :)

Posted
14 hours ago, pixelsearch said:

For now I got issues with the maximum number of characters Google allows for translation

You can probably increase the maximum number of characters if you really use POST instead of a disguised GET.
The return limit here seems to be 65,535 characters.

As an example:

#include "Json.au3"
#include <String.au3>

; build long string
$sText = _StringRepeat("Wie geht`s? ", 2^16 / 12)
ConsoleWrite("string length: " & StringLen($sText) & @CRLF)

; translate the string
$sTranslated = _GoogleAPITranslate($sText , "de", "en")
ConsoleWrite("return length: " & StringLen($sTranslated) & @CRLF & @CRLF)
ConsoleWrite($sTranslated & @CRLF)


Func _GoogleAPITranslate($sMytext, $sFrom, $sTo)

    ; format and send request
    Local $sResponse
    With ObjCreate("winhttp.winhttprequest.5.1")
        .Open("POST", "https://translate.googleapis.com/translate_a/single", False)
        .SetRequestHeader("Content-Type", "application/x-www-form-urlencoded")
        .Send(StringFormat("client=gtx&sl=%s&tl=%s&dt=t&q=%s", $sFrom, $sTo, _URIEncode($sText)))
        $sResponse = .ResponseText
    EndWith
    Local $vResponse = _JSON_Parse($sResponse)

    ; process return
    Local $aData, $sOutput = ""
    If VarGetType($vResponse) = 'Array' Then
        $aData = $vResponse[0]
        If VarGetType($aData) = 'Array' Then
            For $i = 0 To UBound($aData) -1
                $sOutput &= ($aData[$i])[0]
            Next
        EndIf
    EndIf

    Return $sOutput
EndFunc


Func _URIEncode($sData)
    ; Prog@ndy
    Local $aData = StringSplit(BinaryToString(StringToBinary($sData, 4), 1), "")
    Local $nChar
    $sData = ""
    For $i = 1 To $aData[0]
        $nChar = Asc($aData[$i])
        Switch $nChar
            Case 45, 46, 48 To 57, 65 To 90, 95, 97 To 122, 126
                $sData &= $aData[$i]
            Case 32
                $sData &= "+"
            Case Else
                $sData &= "%" & Hex($nChar, 2)
        EndSwitch
    Next
    Return $sData
EndFunc   ;==>_URIEncode

 

Posted

First of all: Interesting topic #translation 😀 .


Just as a ⚠ hint @AspirinJunkie , the usage of the _URIEncode() function by Prog@ndy could be result in an invalid string (for some cases).
At least I had to create my own function for the specific Content-Type application/x-www-form-urlencoded in the near past.

In case you are interested, please see 🔗 this link as an example. The following json string will be differently encoded and is invalid with _URIEncode().

; See the whole script in the link above
Local Const $sGuid = '33417ed0-31b0-4b58-9310-196886274926'
Local Const $sJson = 'updateIDs = [{ "uidInfo": "' & $sGuid & '", "updateID": "' & $sGuid & '", "size": 0 }]'

But with this encoding function, the POST will be okay.

; See the whole script in the link above
Func _UrlEncode($sText)
    Local Const $iUtf8Flag          = 4
    Local Const $iCaseSensitiveFlag = 1
    Local Const $sCharacters        = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!$&/()=?+*',;:.-_@"

    Local Const $sUtf8Binary = StringToBinary(StringReplace($sText, ' ', ''), $iUtf8Flag)
    Local Const $sUtf8String = StringReplace($sUtf8Binary, '0x', '', $iCaseSensitiveFlag)

    Local $sChar, $sEncodedText

    For $i = 1 To StringLen($sUtf8String) Step 2
        $sChar = StringMid($sUtf8String, $i, 2)

        If StringInStr($sCharacters, BinaryToString('0x' & $sChar, $iUtf8Flag)) Then
            $sEncodedText &= BinaryToString('0x' & $sChar)
        Else
            $sEncodedText &= '%' & $sChar
        EndIf
    Next

    Return $sEncodedText
EndFunc

💡 I only talk about the linked example and I did not tested the translate API endpoint yet.
This is a bit off-topic I guess, sorry for that. But I wanted to bring this up, at least as a note 😇 .

Best regards
Sven

Stay innovative!

Spoiler

🌍 Au3Forums

🎲 AutoIt (en) Cheat Sheet

📊 AutoIt limits/defaults

💎 Code Katas: [...] (comming soon)

🎭 Collection of GitHub users with AutoIt projects

🐞 False-Positives

🔮 Me on GitHub

💬 Opinion about new forum sub category

📑 UDF wiki list

✂ VSCode-AutoItSnippets

📑 WebDriver FAQs

👨‍🏫 WebDriver Tutorial (coming soon)

Posted

@AspirinJunkie thanks a lot, your last script helps to increase the maximum size allowed for the original text length.

So it seems that we can use a "Microsoft.XMLHTTP" object (as you did in your 1st script) or a "WinHTTP.WinHttpRequest.5.1" object (as you just did above). Though they are differences between these 2 objects, both seem to work same for their Open or Send methods etc...

I indicate below the changes you made for this to work :

; $oHTTP = ObjCreate("Microsoft.XMLHTTP") ; or we can choose the object below.
$oHTTP = ObjCreate("WinHTTP.WinHttpRequest.5.1")

; $sUrl  = "https://translate.googleapis.com/translate_a/single?client=gtx&sl=" & $sFrom & "&tl=" & $sTo & "&dt=t&q=" & _URIEncode($sText) ; bad
$sUrl  = "https://translate.googleapis.com/translate_a/single" ; good

$oHTTP.Open("POST", $sUrl, False)

$oHTTP.SetRequestHeader("Content-Type", "application/x-www-form-urlencoded") ; mandatory line

; $oHTTP.Send() ; bad
$oHTTP.Send(StringFormat("client=gtx&sl=%s&tl=%s&dt=t&q=%s", $sFrom, $sTo, _URIEncode($sText))) ; good

$sResponse = $oHTTP.ResponseText

 

22 hours ago, AspirinJunkie said:

The return limit here seems to be 65,535 characters.

Not really sure about that after tests I just made : for example I got correct returns of 75004 characters in the Output Edit field. As long as the encoded original text got a size <= 65536 , then you should get a correct output result, no matter the output size.

But when the size of the encoded original text is > 65536 , then you'll have an Edit Output field filled with exactly 65535 encoded chars (thx Free Clipboard Viewer), for example  :

Toolongencodedoriginaltextnotranslation.png.39ef83b5b72ee39d52e551f2060df476.png

The original text of the picture above has been created like this :

; 10922 * 6 = 65532 ; as л will be coded "%D0%BB" (6 length)

$s = "0 "
For $i = 1 To 10922 ; but not 10923 => error and returns... the truncated encoded input string with 65535 chars "0+%D0%BB%D0%BB%D0%BB..."
    $s &= "л"
Next
$s &= " 1"
ClipPut($s)

Another test I did to confirm this, resumed briefly :
English sentence "I'm sick. " (10 length)
It will be translated to german by "Ich bin krank. " (15 length)

If you concatenate the input sentence 5000 times, it will create an input text of exactly 50000 chars (this won't exceed 65536 chars after being encoded) . When you translate it, the returned text is exactly 75000 length (1.5 size as the original) without error and it appears correctly in the output field. You can surround the input field with "0 " at the beginning and " 1" at the end to make sure that the beginning and the end appear correctly in the output field. Then you check the size of the content of the output field (copied to clipboard) with Free Clipboard viewer : 75000 chars (or 75004 if you added "0 " at beginning and " 1" at the end)

Now I understand better why 5000 characters should be the limit in the original text : imagine these 5000 characters are all Unicode, coded on 4 bytes each. Each character will require 12 length when encoded and sent, i.e. "%..%..%..%.." per character !

5000 * 12 = 60000 and we're not really far of the 65536 limit for the encoded input text. That's why I'll use 5000 as a limit for the original edit field (or 5400, as 5400 * 12 = 64800, still < 65536)

Have a great day :)

Posted
9 hours ago, SOLVE-SMART said:

, the usage of the _URIEncode() function by Prog@ndy could be result in an invalid string (for some cases).
At least I had to create my own function for the specific Content-Type application/x-www-form-urlencoded in the near past.

I haven't looked at the URL standard to see what would be exactly correct.
In your function, however, the spaces are consistently removed, which leads to failure in the use case here, for example, where you want to pass texts with spaces (except in cases where Google manages to interpret the text without spaces).

I have also compared the output of the functions with common URL encoders on the net and Progandy's function corresponds to this completely except for the small difference that it translates spaces as "+" instead of "%20". So you only have to use a StringReplace(_URIEncode($sJson), "+", "%20") and the function returns the correct URL-encoded string, while the output of your function is very different from their result.

39 minutes ago, pixelsearch said:

we can use a "Microsoft.XMLHTTP" object (as you did in your 1st script)

I think that was Musashi's script rather than mine?

42 minutes ago, pixelsearch said:

That's why I'll use 5000 as a limit for the original edit field

That would be a bit of a shame to limit yourself more than is necessary - wouldn't it?
You could code in the background during input and dynamically deduce whether the size has been exceeded or not.

Btw: DeepL is much better 😉

Posted

lol AspirinJunkie, while you were posting, I was preparing the following text... to modify my precedent script, now I'll post it below :)

Edit: with a limit of 5000 characters in the original text, no translation error should be encountered, as the encoded original data will never exceed 65536 bytes, even if the "worst" case when all characters are Unicode requiring 4 bytes each.

What about if we limit to 10000 (or 15000) characters, instead of 5000 ?
This seems more useful for users having an original text composed mostly of letters & numbers, i.e. 90% of the caracters being Ascii 0-127

In this case, an option could be to check the length of the encoded text just before it is sent.
* If <= 65536, no problem and it is sent.
* If > 65536, warn the user that his text is too long (maybe indicate the difference of size between the encoded text size and 65536, though this difference won't tell the user how many characters it represents in the input field, depending on Ascii or Unicode characters) and give him the option to translate the first 65536 encoded characters, or to shorten his original text. We'll see...

 

 

Posted
12 minutes ago, pixelsearch said:

In this case, an option could be to check the length of the encoded text just before it is sent.

Encoding should be quick.
Why not simply make an adlib every 1s, which encodes the input each time, checks the length and signals to the user whether he still has characters left by means of a small traffic light (red or green), for example?

  • 2 months later...
Posted

@AspirinJunkie Hello. if you remember, you wrote me this a few months ago :

"Enter the following as the translation string: ‘이것은 테스트입니다.’. This is Korean and is not translated successfully. Instead, the detected language is ‘en’ - which is obviously wrong."

I replied to you that I didn't have an issue with this sentence, because since day 1, my translation script showed 'ko' as detected language :

Koreandetected.png.2ca2a137f3f30e65f522fa016be4ec82.png

But from time to time, I have an "issue" with some Chinese translation strings, where the detected language is... English (instead of Chinese) so I searched a way to solve it. I added this test a few hours ago :

If $sLangUsed Then ; translation has succeeded and $sLangUsed always contains the code that Google used for its last translation
    If $sLangFrom = "auto" And $sLangUsed = "en" Then ; this test should display a more reliable $sLangUsed in some cases, we'll see...
        Local $sPattern = '(?U).+,\[\[\[".+","(zh)_en_.+"]'
        Local $aArray = StringRegExp($sResponse, $sPattern, 1) ; yes, 1
        If Not @error Then $sLangUsed = $aArray[0] & " => en" ; "zh => en" : chinese found (though $sLangUsed for translation was "en")
    EndIf
    ...

Here is an example where it solves the situation, at least we can read "zh" in the From language (zh means Chinese)

Chinesefound.png.0be837cd8b0dc7085ffd610f02ac754f.png

There are examples in this link (Dialogue 2) where the "Chinese Simplified Characters" text is detected as "Chinese (Simplified)" by the script, but the "Chinese Traditional Characters" text is detected as "English" without the code above. With the additional test, it is detected as "zh => English" which looks better. We note that the German translated text is exactly the same in both cases (saved in 2 text files and compared with Beyond Compare : not a single byte differs).

I didn't update yet the Translation scripts on the Forum because I want to check if there are side effects with this test. If everything seems ok in a few weeks, then I'll add the test in the scripts on the Forum (also displaying 'Chinese' instead of 'zh')

Thanks for reading and have a great week-end :bye:

Posted (edited)

@jchd hello

I notice my pattern from last post takes a long time (several seconds on a slow PC) before it fails on a subject having a length of 7.500 characters :

Local $sPattern = '(?U).+,\[\[\[".+","(zh)_en_.+"]'

If I change the first .+ to .* then it fails in a snap, which is much better :

Local $sPattern = '(?U).*,\[\[\[".+","(zh)_en_.+"]'

Did you already face this situation, where a star quantifier should be preferred to a plus quantifier (if possible) for speed reason ?

I can provide the subject string if you need it.
Thanks

Edited by pixelsearch
typo
Posted (edited)

Yes this is a common issue. It has to do with the way PCRE v1 works and backtracking. The official doc examplifies that silly (yet logical) behavior and shows ways to speed up failures.

You can also run the pattern on RegExp aud use the dubugging feature to watch how both patterns work in practice. Use a shorter string.

Sorry I don't have much time to look deeper.

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...