Jump to content

Filtering out control characters from copied text


Guy_
 Share

Recommended Posts

I often copy text from a website or pdf into a variable and once in a while pasting it back into WordPad gives weird results.

It used to originate more frequently within larger Facebook texts or YouTube comments.

One example from a pdf is where bullets were changed into a corner like character, etc.

I assume many of these could be control characters?

What is the best way to filter them out, please?

From reading in the manual, my only guess was something like the following, but it seems to do nothing (not sure though, and less easy to test for me...).

$text = StringRegExpReplace ( $text, '[[:cntrl:]]', "" )

Or is it something with [:print:] ?  (meaning, "give me only the characters that would normally print?")

I don't mind if your solution removes Returns too (though ideally not), cause I usually remove those myself.

Thank You for any pointers! :)

Edited by Guy_
Link to comment
Share on other sites

This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though.

 

Not necessarily. You can do that sort of thing with StringRegExpReplace probably.

For example, to replace everything that is NOT a-z, A-Z or 0-9 in your text with "" ...

$text = StringRegExpReplace ( $text,  '[^[:alnum:]]', "" )

And then you can add other characters to it that you are still missing, but may need a lot of escape characters and will look a mess...

I would be afraid to miss out on a few characters too, so I am hoping the other way round exists too and is neater code (and/or faster).

Link to comment
Share on other sites

There are several options open but there is something unclear: "One example from a pdf is where bullets were changed into a corner like character"

That seems to means this is some ANSI codepage XYZ blindly transfered to ANSI codepage ABC.

Neither bullets nor framing symbols are control characters.

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.

I've tried that in the first message, but the "corner" character wouldn't display.

I was prepared for something like your explanation anyway and it's the lesser of my worries.

Weird stuff can happen or be manipulated with pdf files it seems.

I think I even have a pdf that displays normal readable text, but if you copy from it it's a garbled mess of characters, probably on purpose.

-

Since I believe I usually have horizontal spacing problems in my output, for now I've put in these lines and I'll see how that goes...

$text = StringRegExpReplace( $text, '\h', " " )
$text = StringRegExpReplace( $text, '[ ]{2,}', " " )

I'm hoping that should make any amount of horizontal spacing into one space, which I'm very ok with.

I had one example on YouTube from a while ago, but at the moment it doesn't show the problem I was getting anymore...

I'll dig this thread up again if I run across an example later.

And I'm still hoping other people have needed this and for an elegant solution to give me all displaying characters (+ space) without any control chars & stuff.

Link to comment
Share on other sites

Read the doc of StringRegExp. There you'll see that by enabling Unicode category properties you have access to a whole new world of character classes. The discussion of this in detail would have rendered our help file too complex for newcomers but you'll find details explained in full in the official PCRE documentation (link below) under pcrepattern.

For instance you can detect all Unicode symbols of a string with the class "(*UCP)[pS]"

Edited by jchd

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Thanks for the pointers, jchd!

I do find some clues there, but it may need a total study of RegEx before I can do anything with it, as something like this (although I need the reverse) doesn't seem to do anything:

$text = StringRegExpReplace( $text, '(*UCP)[\pS]', "" )

Maybe I need to activate that PCRE somewhere first. I may look into it further later.

At the moment, I also don't know if ending up with Unicode only would filter out control codes?

-

In the mean time, I did some random YouTube tests and one example is in the comments on http://www.youtube.com/all_comments?v=qTdOxn9MoPg

If you carefully select the line "Trust what you see after you catch bed bugs into a glass jar." and no more, and then paste it somewhere, you'll get an extra kind of space at the end.

I don't even know if that's a control character, but you get it a lot if you accidentally select a little more than the exact word or line in some websites.

If I look at the html source, I don't really get a clue from it... It looks clean.

[...] Trust what you see after you catch bed bugs into a glass jar.</div>

This stuff confuses my program and I'd love to know what kind of code is causing that that I can filter for.

Even though in this case it looks to be some kind of space, even this code (just as a test) didn't filter it out:

$text = StringRegExpReplace( $text, '\h', "" )
Edited by Guy_
Link to comment
Share on other sites

Your example doesn't paste gribberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

Anyway, if you want to remove everything except Unicode letters and digits (whatever language), whitespaces, punctuation and currency symbols (for example) then you can try this:

Local $text = "Abç dêf" & @TAB & "123456.789 - 123000 = 456.789 € (convert to £, ₯ or $ as needed!)" & @CRLF & _
                @TAB & "• First bullet" & @CRLF & _
                @TAB & "‣ Second bullet" & @CRLF & _
                @TAB & "• русский текст" & @CRLF & _
                @TAB & "• 中國文字" & @CRLF & _
                "end of test…" & @TAB & "¿Does that work for you?"
MsgBox(0, "Input text", $text)
Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]|[•‣]", "")
MsgBox(0, "Filtered text", $str)

Of course this is only a sketch which you'll need to adjust to your own needs.

This wonderful site allows debugging and testing regular expressions (many flavors available). An absolute must have in your bookmarks.
Another excellent RegExp tutorial. Don't forget downloading your copy of up-to-date pcretest.exe and pcregrep.exe here
RegExp tutorial: enough to get started
PCRE v8.33 regexp documentation latest available release and currently implemented in AutoIt beta.

SQLitespeed is another feature-rich premier SQLite manager (includes import/export). Well worth a try.
SQLite Expert (freeware Personal Edition or payware Pro version) is a very useful SQLite database manager.
An excellent eBook covering almost every aspect of SQLite3: a must-read for anyone doing serious work.
SQL tutorial (covers "generic" SQL, but most of it applies to SQLite as well)
A work-in-progress SQLite3 tutorial. Don't miss other LxyzTHW pages!
SQLite official website with full documentation (may be newer than the SQLite library that comes standard with AutoIt)

Link to comment
Share on other sites

Your example doesn't paste gibberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

You are right. It seems I *did* select too much there...

You are also right it depends on the browser. If I select too far, Firefox gives me an extra kind of space, IE gives me some kind of newline...

However, your new code pointer is already filtering this off!

So in the first minutes, it looks very promising.

Thank You Very Much  :)

However, I'll still have to figure out how to include important stuff like ".,;:/?)!'"&[](){}*@#" cause it seems to filter all of these out (and more probably) ...?

That makes me wonder what else I'll be missing.

And again, the pdf stuff is the least of my worries. I'd rather keep the bullets for other situations (and that seems an easy fix).

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]", "")

I'm now hoping the chars still missing are a simple "class" or do I have to add them back in manually in some way?

At first glance adding in [:punct:] seems a working fix:

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s[:punct:]]", "")
Edited by Guy_
Link to comment
Share on other sites

  • 5 years later...

You noticed that this thread is 6 years old :/

My UDFs and Tutorials:

Spoiler

UDFs:
Active Directory (NEW 2024-07-28 - Version 1.6.3.0) - Download - General Help & Support - Example Scripts - Wiki
ExcelChart (2017-07-21 - Version 0.4.0.1) - Download - General Help & Support - Example Scripts
OutlookEX (2021-11-16 - Version 1.7.0.0) - Download - General Help & Support - Example Scripts - Wiki
OutlookEX_GUI (2021-04-13 - Version 1.4.0.0) - Download
Outlook Tools (2019-07-22 - Version 0.6.0.0) - Download - General Help & Support - Wiki
PowerPoint (2021-08-31 - Version 1.5.0.0) - Download - General Help & Support - Example Scripts - Wiki
Task Scheduler (2022-07-28 - Version 1.6.0.1) - Download - General Help & Support - Wiki

Standard UDFs:
Excel - Example Scripts - Wiki
Word - Wiki

Tutorials:
ADO - Wiki
WebDriver - Wiki

 

Link to comment
Share on other sites

  • 5 weeks later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...