Tesseract OCR Preprocessing/Text Extraction

codecog2578 · November 3, 2023

Good Morning,

I'm hoping that someone with more experience using Tesseract can assist me with some challenges I'm facing. I'm new to AutoIt and OCRs in general but I'm a programmer by trade so I'm not new to the concepts used, generally speaking. My current issue I'm facing is that I'm really struggling to get accurate text extraction out of a program that I'm working on building automation for. I'm locked to this AutoIt/OCR approach for a variety of boring reasons that aren't really relevant here.

Here is my current code - it's currently messy as I've been experimenting with a variety of approaches but this should shed some insight into what I've attempted thus far. Also sorry for the formatting, it went super wonky when I pasted it in here.

#include <WinAPIProc.au3>
#include <WinAPIRes.au3>
#include <WinAPISys.au3>
#include <Array.au3>
#include <StringConstants.au3>
#include <ScreenCapture.au3>
#include <GDIPlus.au3>

Global $oAutoIt = ObjCreate("AutoItX3.Control")
Global $titleToFind = "C:\Windows\system32\cmd.exe"
Global $outputFile = @ScriptDir & "\output.txt"
Global $tesseractPath = "C:\Program Files\Tesseract-OCR\tesseract.exe"
Global $errorLogFile = @ScriptDir & "\error_log.txt"
Global $screenshotPath = @ScriptDir & "\screenshot.png"


Func ConvertToGrayscale($inputFilePath, $outputFilePath)
    _GDIPlus_Startup()
    Local $hBitmap = _GDIPlus_BitmapCreateFromFile($inputFilePath)
    
    If @error Then
        _GDIPlus_Shutdown()
        Return False
    EndIf

    Local $hImageAttributes = _GDIPlus_ImageAttributesCreate()
    Local $tColorMatrix = _GDIPlus_ColorMatrixCreateGrayScale()
    
    _GDIPlus_ImageAttributesSetColorMatrix($hImageAttributes, 0, True, $tColorMatrix)

    Local $iWidth = _GDIPlus_ImageGetWidth($hBitmap)
    Local $iHeight = _GDIPlus_ImageGetHeight($hBitmap)
    Local $hBitmapGray = _GDIPlus_BitmapCreateFromScan0($iWidth, $iHeight)
    Local $hGraphic = _GDIPlus_ImageGetGraphicsContext($hBitmapGray)

    _GDIPlus_GraphicsDrawImageRectRect($hGraphic, $hBitmap, 0, 0, $iWidth, $iHeight, 0, 0, $iWidth, $iHeight, $hImageAttributes)
    _GDIPlus_ImageSaveToFile($hBitmapGray, $outputFilePath)
    
    _GDIPlus_BitmapDispose($hBitmap)
    _GDIPlus_BitmapDispose($hBitmapGray)
    _GDIPlus_GraphicsDispose($hGraphic)
    _GDIPlus_ImageAttributesDispose($hImageAttributes)
    _GDIPlus_Shutdown()
    Return True
EndFunc

Func BinarizeImage($inputFilePath, $outputFilePath)
    _GDIPlus_Startup()
    
    Local $dpi = 300
    Local $hBitmap = _GDIPlus_BitmapCreateFromFile($inputFilePath)
    
    If @error Then
        _GDIPlus_Shutdown()
        Return False
    EndIf
    
    Local $iWidth = _GDIPlus_ImageGetWidth($hBitmap)
    Local $iHeight = _GDIPlus_ImageGetHeight($hBitmap)
    _GDIPlus_BitmapSetResolution($hBitmap, $dpi, $dpi) 
    
    Local $hGraphic = _GDIPlus_ImageGetGraphicsContext($hBitmap)
    Local $hBitmapBW = _GDIPlus_BitmapCreateFromScan0($iWidth, $iHeight)
    Local $hGraphicBW = _GDIPlus_ImageGetGraphicsContext($hBitmapBW)
    
    _GDIPlus_GraphicsDrawImageRectRect($hGraphicBW, $hBitmap, 0, 0, $iWidth, $iHeight, 0, 0, $iWidth, $iHeight)

    Local $hBitmapBW2 = _GDIPlus_BitmapCreateFromScan0($iWidth, $iHeight, $GDIP_PXF01INDEXED)
    Local $hGraphicBW2 = _GDIPlus_ImageGetGraphicsContext($hBitmapBW2)
    
    _GDIPlus_GraphicsDrawImageRectRect($hGraphicBW2, $hBitmap, 0, 0, $iWidth, $iHeight, 0, 0, $iWidth, $iHeight)
    
    _GDIPlus_ImageSaveToFile($hBitmapBW2, $outputFilePath)

    _GDIPlus_BitmapDispose($hBitmap)
    _GDIPlus_BitmapDispose($hBitmapBW)
    _GDIPlus_BitmapDispose($hBitmapBW2)
    _GDIPlus_GraphicsDispose($hGraphic)
    _GDIPlus_GraphicsDispose($hGraphicBW)
    _GDIPlus_GraphicsDispose($hGraphicBW2)
    _GDIPlus_Shutdown()
    Return True
EndFunc

Func FindActiveCmdWindow()
    Local $aWindowsList = WinList()
    For $i = 1 To $aWindowsList[0][0]
        If StringInStr($aWindowsList[$i][0], $titleToFind) Then
            If _WinAPI_IsWindowVisible($aWindowsList[$i][1]) Then
                Return $aWindowsList[$i][1]
            EndIf
        EndIf
    Next

    Return 0
EndFunc

Func ScrapeCmdWindow()
    Local $hCmdWindow = FindActiveCmdWindow()

    If $hCmdWindow Then
        Local $aWinPos = WinGetPos($hCmdWindow)
        Local $x = $aWinPos[0]
        Local $y = $aWinPos[1]
        Local $width = $aWinPos[2]
        Local $height = $aWinPos[3]

        Local $screenshotPath = @ScriptDir & "\screenshot.png"
        _ScreenCapture_Capture($screenshotPath, $x, $y, $x + $width, $y + $height)

        Local $grayscalePath = @ScriptDir & "\grayscale_screenshot.png"
        If ConvertToGrayscale($screenshotPath, $grayscalePath) Then
            ConsoleWrite("Grayscale screenshot saved to: " & $grayscalePath & @CRLF)
            
             $imagemagickCommand = '"' & "C:\Program Files\ImageMagick-7.1.1-Q16-HDRI\convert" & '" "' & $grayscalePath & '" -density 300 -contrast -                normalize -resize 300% "' & @ScriptDir & '\grayscale_screenshot.png"'

            ConsoleWrite("Running ImageMagick with command: " & $imagemagickCommand & @CRLF)
            $exitCode = RunWait($imagemagickCommand, @ScriptDir, @SW_HIDE)
            
            If $exitCode = 0 Then
                ConsoleWrite("ImageMagick processing completed." & @CRLF)
            Else
                WriteErrorLog("ImageMagick processing encountered an error. Exit code: " & $exitCode)
                Local $imagemagickStdErr = StdoutRead($STDERR_CHILD)
                WriteErrorLog("ImageMagick Error Output: " & $imagemagickStdErr)
                MsgBox($MB_ICONERROR, "Error", "ImageMagick processing failed.")
            EndIf
            ; End ImageMagick processing

            ; Continue with Tesseract OCR as before
            Local $binarizedPath = @ScriptDir & "\binarized_screenshot.png"
            If BinarizeImage($grayscalePath, $binarizedPath) Then
                ConsoleWrite("Binarized screenshot saved to: " & $binarizedPath & @CRLF)

                Local $ocrOutput = ""
                Local $tesseractCommand = '"' & $tesseractPath & '" "' & $grayscalePath & '" "' & @ScriptDir & '\ocr_output" --psm 4'
                
                ConsoleWrite("Running Tesseract with command: " & $tesseractCommand & @CRLF)
                Local $exitCode = RunWait($tesseractCommand, @ScriptDir, @SW_HIDE, $STDERR_CHILD)

                If $exitCode = 0 Then
                    If FileExists(@ScriptDir & "\ocr_output.txt") Then
                        $ocrOutput = FileRead(@ScriptDir & "\ocr_output.txt")
                        FileDelete(@ScriptDir & "\ocr_output.txt")
                    EndIf
                Else
                    WriteErrorLog("Tesseract OCR encountered an error. Exit code: " & $exitCode)
                    Local $tesseractStdErr = StdoutRead($STDERR_CHILD)
                    WriteErrorLog("Tesseract Error Output: " & $tesseractStdErr)
                EndIf

                Clean up and delete the original, grayscale, and binarized screenshot files
                If FileExists($screenshotPath) Then
                    FileDelete($screenshotPath)
                EndIf
                If FileExists($grayscalePath) Then
                    FileDelete($grayscalePath)
                EndIf
                If FileExists($binarizedPath) Then
                    FileDelete($binarizedPath)
                EndIf

                If FileExists($outputFile) Then
                    FileDelete($outputFile)
                EndIf

                FileWrite($outputFile, $ocrOutput)

                MsgBox($MB_ICONINFORMATION, "Success", "Content saved to " & $outputFile)
            Else
                WriteErrorLog("Binarization failed.")
                MsgBox($MB_ICONERROR, "Error", "Binarization failed.")
            EndIf
        Else
            WriteErrorLog("Grayscale conversion failed.")
            MsgBox($MB_ICONERROR, "Error", "Grayscale conversion failed.")
        EndIf
    Else
        WriteErrorLog("No active cmd window found.")
        MsgBox($MB_ICONERROR, "Error", "No active cmd window found.")
    EndIf
EndFunc

Func WriteErrorLog($message)
    Local $errorFile = FileOpen($errorLogFile, 1)
    If $errorFile = -1 Then
        MsgBox($MB_ICONERROR, "Error", "Unable to open the error log file.")
        Exit
    EndIf
    FileWriteLine($errorFile, @YEAR & "/" & @MON & "/" & @MDAY & " " & @HOUR & ":" & @MIN & ":" & @SEC & " - " & $message)
    FileClose($errorFile)
EndFunc

ScrapeCmdWindow()

I've tried a variety of preprocessing techniques but I think a big part of the issue is the original color scheme of the window I'm trying to extract stuff from. Please see the attached examples of the original window as well as when I send it grayscale. Ignore the blocked out bits, that's just for protecting private data. In the greyscale the green arrow is indicating the body text which I can extract reasonably well, but the red arrows indicate headers that Tesseract absolutely butchers. I can't binarize it or I lose the headers entirely as the header text is the same color as the background. Additionally the vertical and horizontal lines are often interpreted by Tesseract as random characters and my attempts to remove them thus far have been unsuccessful.

So, the heart of my question is this...for all you Tesseract experts, how would YOU go about cleaning up the original horrid blue UI to a format that Tesseract can parse reasonably accurately? It's not critical that it's 100% accurate - I mostly need it to reliably extract the headers so my automations for this program know which sub menu the end user is sitting on. I'm locked to using an OCR due to this being a remote application running inside of a cmd window within another remote environment. It's a messy setup and thus far I'm sitting at probably 50% accuracy on pulling things out. I've been playing with ImageMagick as well to work on preprocessing the image to a point where it's easy for Tesseract to deal with but thus far no luck increasing the accuracy in any notable way.

If someone out there could assist me that'd be appreciated! I don't expect code but I'm a bit baffled on how to go about processing the original image so Tesseract can extract everything I need; so I'm looking for some insight and direction.

Thank you!

BigDaddyO · November 3, 2023

I've had luck in the past with identifying the background color then using GDI+ to remove that color from the entire image... BUT! it looks like your background color is the same color as your actual text you are looking for so that's not going to help.

I know you said you have to use OCR, but If all your looking for is text off the screen somewhere, have you tried using the Cmd options to select all/copy and then look in the clipboard text for the possible header text you want? or, see if the Au3Info object spy can read the text from the window so you can use WinGetText("C:\Windows\system32\cmd.exe", "")

ioa747 · November 4, 2023

first of all i will agree with BigDaddyO, if you can select all/copy text in the clipboard and then look what you want,
the result will be more accurate (especially when we have to deal with numbers)

If not, then
First make a copy from cmd shortcut in your @ScriptDir

Then right-click and select Properties

In the shortcut tab adjust the target to call your program

Then In the Font tab adjust the font, to Lucida Console (where the zeros have no bisection)

Then In the Colors tab adjust the Colors

open-colors-and-choose-screen-background

Then, since all of that is working, and run some tests

Maybe it's better, instead of reading them all together, and then extracting the result,
let him read them one by one.
E.G. you don't need to read the word DISALLOWED,
but give him the coordinates from where he will read his value

good luck

codecog2578 · November 6, 2023

Hey guys!

Thanks for the replies thus far. I should have mentioned this in the original post but part of what makes this intensely frustrating is that what you see in the GUI I posted is actually NOT just a standard cmd window. It's an entire remote machine with the view truncated to show JUST the program I showed screencaps of - and the kicker is I can't interface with that remote environment where the program lives beyond what I'm seeing in the pseudo cmd window. So there's no way I've discovered to scrape information out of the GUI with any sort of copying or reading like I'd usually do, hence the struggle with the OCR approach...and then as @BigDaddyO pointed the color modification is troublesome too due to the header text being the same color as the background text.

So from here I'm uncertain where to go, really. The text extraction with any of the pre processing I've done is just far too erratic to be useful and the third party that owns this program I'm attempting to automate won't provide me with any meaningful access to read things in a way that, you know, would make sense.

YGYL0 · November 7, 2023

in PS can use Threshold or Invert，like this

robertocm · November 7, 2023

Perhaps this could help:

Sign In

Tesseract OCR Preprocessing/Text Extraction

Recommended Posts

codecog2578

BigDaddyO

ioa747

codecog2578

YGYL0

robertocm

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

AutoIt Resources

Release

Beta