Publisher 2016 scripts
Samuel Murray, 2017

* publisher_extract.au3
* publisher_paster.au3

Yet to be written:

* publisher_FR.au3

Note: these scripts ensure that the right translation is in the right text box, but all or most formatting is lost!

You need to install AutoIt.

==

USAGE

Extract:

Run the extract script.  Then, click each a text box, and press the shortcut.  The script will replace the text box's content with an ID number, and copy the ID number and the text to a file called "tmfile.txt".

The shortcut ` will copy only the current text box.  The shortcut Ctrl + ` is meant for tables, as it will attempt to move up to the previous cell each time.  For this reason, when using Ctrl + `, always put your cursor in the last cell of the table.

In fact, it sometimes works best if you start extracting from the bottom right of each page, otherwise earlier text boxes can end up overlapping later ones, making extraction difficult.

Whenever you've extracted text from a table, you might need to press Shift+Ctrl+Alt for a second or so, otherwise Windows might think you're holding down the Ctrl key.  It's a Windows bug.

The tmfile.txt is a mini TM, which you can translate in your favourite CAT tool.  Do not alter the BOX IDs and the line breaks before and after them.  When you're ready to paste the translation, save the mini TM as a plaintext file, UTF8 with BOM.

The script will ignore table cells that do not contain letters (edit the script if you don't want this behaviour).

Paste:

Run the paste script and select the mini TM file. Then, click each a text box, and press the shortcut.  The script will cpoy the BOX ID number, look it up in the mini TM, and paste the translation in the text box.

There is no separate shortcut for tables.  When pasting in tables, simply press Tab yourself to move to the next cell.

From time to time you might need to press Shift+Ctrl+Alt for a second or so.

In the end, do a search in your Publisher file for "{{{BOX" to check if you've missed any.

==

RATIONALE

Publisher has a few cool features, but they're not cool for translators.  There is no way to import a translation.  You can extract individual pages to HTML, and you can create a PDF and use OCR on it to generate plain text, but that's about it.

Pressing Ctrl+A in a text box or a cell will select all text in it.

Character and line formatting does not survive a roundtrip via MS Word. If you paste into a cell, the pasted text will take on the cell's formatting, and if you find/replace any text, the replaced text will take on the found text's formatting.  This excludes formatting within sentences, though (e.g. a single word in bolded).

You can't find line breaks.  So if the author used soft line breaks to format a sentence, you can only "find" the portion of text between those breaks, and not the entire sentence.

I know of no keyboard shortcut that moves from one text box to another, but you can move from cell to cell with Tab and Shift+Tab.  Pressing Tab too many times creates new rows, but pressing Shift-Tab too many times is harmless.

==

WORKAROUNDS


Scenario 1: as long as the right translation is in the right box.

Use publisher_extract.au3 and publisher_paster.au3

Create a copy of the original file, named e.g. UPDATED.  Use the UPDATED file to extract tables and non-tables.  The extract script replaces table and non-table content with IDs, and the paste script replaces the IDs with the translated content.

All formatting will be lost.  The right content will be in the right box, but it won't be formatted correctly.  Useful if your client says "I'll take care of the formatting, just you make sure the right text is in the right box".


Scenario 2: try to retain formatting of text segments.

Use publisher_extract.au3, publisher_paster.au3, and publisher_FR.au3 (still to be written)

Create two copies of the original file, named e.g. TEMP and e.g. UPDATED.  Use the TEMP file to extract non-tables and tables with problematic formatting.  After extraction, delete the TEMP file.  Use the UPDATED file to extract tables (and non-tables with unproblematic formatting).  The extract script replaces extracted content in the UPDATED file with IDs, and the paste script replaces the IDs with the translated content.

The content that was extracted from the TEMP file (which is still untranslated in the UPDATED file) will be replaced using a complicated and silly find/replace operation that can find/replace only portions of text between line breaks.

Also, if a piece of text contains inline formatting, Publisher can find it, but it can't perform a replacement (even if you'd be happy with all formatting lost).

I'll write this later, when I get a client that requires it.  But essentially, you create a two column file with source text left and your translation right, and then after translation, you sort the source column by line length (long to short), and then you use Publisher's own find/replace operation to find each English chunk of text and replace it with its translation.  So you see why this has to be done from long to short.


Workaround 3: use MS Word temporarily

Copy/pasting content from Publisher to MS Word retains quite a bit of formatting, so if I could create a separate Word file for each text box in the PUB file, I could theoretically retain a lot more formatting, but that would be a slow workaround, suitable only for PUB files with most of its content in singular text boxes.  Todo, much later.