Google Translation API Hacking03
20

As part of its Google Cloud, Google offers the Google Translation API with a usage-based cost structure. In addition, there is an undocumented API that can be used without a key, but which denies service after only a few requests. When using the website translation function of Google Chrome, it is noticeable that pages can be translated here in very good quality without any noticeable limitation.

apparently, the advanced nmt model is already used here. but what API does Google Chrome use internally to translate the content and can this API be accessed directly - even on the server side? to analyze network traffic, tools like Wireshark or Telerik Fiddler are recommended, which can also analyze encrypted traffic. but Chrome even delivers the requests it sends during page translation free of charge: they can be easily viewed via the Chrome DevTools:

If you do a translation, then catch the decisive POST request https://translate.googleapis.com via "Copy > Copy as cURL (bash)" and execute it in a tool like Postman, for example, you can resend the request without any problems:

The meaning of the URL parameters are also largely obvious:

Key	Example Value	Meaning
anno	3	Annotation mode (affects the return format)
client	te_lib	Client information (varies, via the web interface of Google-Translate the value is "webapp"; affects the return format and the rate limiting)
format	html	String format (important for translating HTML tags)
v	1.0	Google Translate version number
key	AIzaSyBOti4mM-6x9WDnZIjIeyEU21OpBXqWBgw	API key (see below)
logld	vTE_20200210_00	Protocol version
sl	en	Source language
tl	en	Target language
sp	nmt	ML model
tc	1	unknown
sr	1	unknown
tk	709408.812158	Token (see below)
fashion	1	unknown

Some request headers are also set - but these can mostly be ignored. After manually deselecting all headers, including those from the user agent , an encoding problem is discovered when entering special characters (here when translating " Hello World "):

If you reactivate the user agent (which generally doesn't do any harm), the API delivers UTF-8 encoded characters:

Now that we have all the information we need to use this API outside of Google Chrome, if we change the string to be translated (data field q of the POST request) from, for example, "Hello World" to "Hello World", we get an error message:

We now retranslate this modified one within Google Chrome using the web page translation function and find that the parameter tk has changed as well as the parameter q (all other parameters have remained the same):

Obviously it is a string dependent token, whose structure is not easy to see, but when you start the web page translation, the following files are loaded:

1 CSS file: translateelement.css
4 graphics: translate_24dp.png (2x), gen204 (2x)
2 JS files: main_de.js , element_main.js

The two JavaScript files are obfuscated and minified. Tools like JS Nice and de4js are now helping us to make these files more readable. In order to debug them live, we recommend the Chrome Extension Requestly, which tunnels remote files locally on the fly:

Now we can debug the code ( CORS must be enabled on the local server) The relevant code section for generating the token seems to be hidden in the file element_main.js in this section:

b7739bf50b2edcf636c43a8f8910def9

Here, among other things, the text is hashed with the help of some bit shifts. But unfortunately we are still missing a piece of the puzzle: To the function Bp(), besides the argument a (which is the text to be translated), another argument b is passed - a kind of seed, which seems to change from time to time and which is also included in the hashing. But where does it come from? If we jump to the function call of Bp(), we find the following code section:

b7739bf50b2edcf636c43a8f8910def9

The function Hq is previously declared as follows:

b7739bf50b2edcf636c43a8f8910def9

Here the Deobfuscater left some rubbish; After we have replaced String.fromCharCode ('...') with the respective character strings, remove the obsolete a () and piece together the function calls [c (), c ()] , the result is:

b7739bf50b2edcf636c43a8f8910def9

Or even easier:

b7739bf50b2edcf636c43a8f8910def9

The function yq is previously defined as:

b7739bf50b2edcf636c43a8f8910def9

So the seed seems to be in the global object google.translate._const._ctkk, which is available at runtime. But where is it set? In the other, previously loaded JS-file main_en.js at least it is also available at the beginning. For this we add the following at the beginning:

b7739bf50b2edcf636c43a8f8910def9

In the console we now actually get the current seed:

This leaves Google Chrome itself, which apparently provides the seed, as the last option. Fortunately, its source code (Chromium, including the Translate component) is open source and therefore publicly available. We pull the repository locally and find the call to the TranslateScript :: GetTranslateScriptURL function in the translate_script.cc file in the components / translate / core / browser folder:

b7739bf50b2edcf636c43a8f8910def9

The variable with the URL is hard defined in the same file:

b7739bf50b2edcf636c43a8f8910def9

If we now examine the element.js file more closely (after deobfuscating again), we find the hard-set entry c._ctkk - the google.translate object is also set accordingly and the loading of all relevant assets (which we have already discovered earlier) is triggered:

b7739bf50b2edcf636c43a8f8910def9

Now the parameter key remains for consideration (with the value AIzaSyBOti4mM-6x9WDnZIjIeyEU21OpBXqWBgw). That seems to be a generic browser API key (which can also be found in some Google results ). It is set in Chromium in the file translate_url_util.cc in the folder components / translate / core / browser:

b7739bf50b2edcf636c43a8f8910def9

The key is generated in google_apis / google_api_keys.cc from a dummy value:

b7739bf50b2edcf636c43a8f8910def9

However, a test shows that the API calls work the same without this key parameter. If you experiment with the API, you will get the status code 200 back if you are successful. If you then run into a limit, you get the status code 411 back with the message " POST requests require a content-length header ". It is therefore advisable to include this header (which is automatically set as a temporary header in Postman).

The return format of the translated strings is unusual when there are several sentences in one request, and the individual sentences are enclosed by the i-/b-HTML tags:

Also, Google Chrome does not send all the HTML to the API, but saves attribute values such as href in the request (and sets indexes instead, so that the tags can be assigned again later on the client side):

If you change the value of the POST key client from te_lib (Google Chrome) on webapp ( Google Translation website ), you get the final translated string:

The problem is that you are much more likely to run into rate limiting than via te_lib (for comparison: with webapp this is reached after 40,000 chars, with te_lib there is no rate limiting). So we need to take a closer look at how Chrome parses the result. We'll find it here in element_main.js:

b7739bf50b2edcf636c43a8f8910def9

If you send the entire HTML code to the API, it leaves the attributes in the translated response. We therefore do not have to imitate the entire parse behavior, but only extract the final, translated string from the response. To do this, we build a small HTML tag parser that discards the outermost <i> tags including their content and removes the outermost <b> tags. With this knowledge we can now (after installing dependencies with composer require fzaninotto / faker vielhuber / stringhelper ) build a server-side version of the translation API:

b7739bf50b2edcf636c43a8f8910def9

The following are the results of an initial test that was carried out on five different systems with different bandwidths and IP addresses:

Character	Characters per request	Duration	Error rate	Cost via official API
13.064.662	~250	03: 36: 17h	0%	237,78€
24.530.510	~250	11: 09: 13h	0%	446,46€
49.060.211	~250	20:39:10h	0%	892,90€
99.074.487	~1000	61: 24: 37h	0%	1803,16€
99.072.896	~1000	62:22:20h	0%	1803,13€
Σ284.802.766	~ Ø550	Σ159:11:37h	0%	Σ € 5183.41

Note: This blog post, including all scripts, was written for testing purposes only; do not use the scripts for productive use, but work with the official Google Translation API instead.

Back

VielhuberDavid

Google Translation API Hacking0320

Google Translation API Hacking03
20