Filtering out control characters from copied text

Guy_ · August 10, 2014

I often copy text from a website or pdf into a variable and once in a while pasting it back into WordPad gives weird results.

It used to originate more frequently within larger Facebook texts or YouTube comments.

One example from a pdf is where bullets were changed into a corner like character, etc.

I assume many of these could be control characters?

What is the best way to filter them out, please?

From reading in the manual, my only guess was something like the following, but it seems to do nothing (not sure though, and less easy to test for me...).

$text = StringRegExpReplace ( $text, '[[:cntrl:]]', "" )

Or is it something with [:print:] ? (meaning, "give me only the characters that would normally print?")

I don't mind if your solution removes Returns too (though ideally not), cause I usually remove those myself.

Thank You for any pointers!

Edited August 10, 2014 by Guy_

computergroove · August 10, 2014

This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though.

Guy_ · August 10, 2014

This cant be the easiest way but you can read a character at a time and delete it if it doesn't match a list of characters you know you want to keep. Lot of coding work probably though.

Not necessarily. You can do that sort of thing with StringRegExpReplace probably.

For example, to replace everything that is NOT a-z, A-Z or 0-9 in your text with "" ...

$text = StringRegExpReplace ( $text,  '[^[:alnum:]]', "" )

And then you can add other characters to it that you are still missing, but may need a lot of escape characters and will look a mess...

I would be afraid to miss out on a few characters too, so I am hoping the other way round exists too and is neater code (and/or faster).

jchd · August 10, 2014

There are several options open but there is something unclear: "One example from a pdf is where bullets were changed into a corner like character"

That seems to means this is some ANSI codepage XYZ blindly transfered to ANSI codepage ABC.

Neither bullets nor framing symbols are control characters.

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.

Guy_ · August 10, 2014

Can you paste an example of such issue? Paste the clipboard directly in the post and try to type what it looks like in the PDF. That or post the offending PDF if it's publickly available.

I've tried that in the first message, but the "corner" character wouldn't display.

I was prepared for something like your explanation anyway and it's the lesser of my worries.

Weird stuff can happen or be manipulated with pdf files it seems.

I think I even have a pdf that displays normal readable text, but if you copy from it it's a garbled mess of characters, probably on purpose.

-

Since I believe I usually have horizontal spacing problems in my output, for now I've put in these lines and I'll see how that goes...

$text = StringRegExpReplace( $text, '\h', " " )
$text = StringRegExpReplace( $text, '[ ]{2,}', " " )

I'm hoping that should make any amount of horizontal spacing into one space, which I'm very ok with.

I had one example on YouTube from a while ago, but at the moment it doesn't show the problem I was getting anymore...

I'll dig this thread up again if I run across an example later.

And I'm still hoping other people have needed this and for an elegant solution to give me all displaying characters (+ space) without any control chars & stuff.

jchd · August 10, 2014

Read the doc of StringRegExp. There you'll see that by enabling Unicode category properties you have access to a whole new world of character classes. The discussion of this in detail would have rendered our help file too complex for newcomers but you'll find details explained in full in the official PCRE documentation (link below) under pcrepattern.

For instance you can detect all Unicode symbols of a string with the class "(*UCP)[pS]"

Edited August 10, 2014 by jchd

Guy_ · August 11, 2014

Thanks for the pointers, jchd!

I do find some clues there, but it may need a total study of RegEx before I can do anything with it, as something like this (although I need the reverse) doesn't seem to do anything:

$text = StringRegExpReplace( $text, '(*UCP)[\pS]', "" )

Maybe I need to activate that PCRE somewhere first. I may look into it further later.

At the moment, I also don't know if ending up with Unicode only would filter out control codes?

-

In the mean time, I did some random YouTube tests and one example is in the comments on http://www.youtube.com/all_comments?v=qTdOxn9MoPg

If you carefully select the line "Trust what you see after you catch bed bugs into a glass jar." and no more, and then paste it somewhere, you'll get an extra kind of space at the end.

I don't even know if that's a control character, but you get it a lot if you accidentally select a little more than the exact word or line in some websites.

If I look at the html source, I don't really get a clue from it... It looks clean.

[...] Trust what you see after you catch bed bugs into a glass jar.</div>

This stuff confuses my program and I'd love to know what kind of code is causing that that I can filter for.

Even though in this case it looks to be some kind of space, even this code (just as a test) didn't filter it out:

$text = StringRegExpReplace( $text, '\h', "" )

Edited August 11, 2014 by Guy_

jchd · August 11, 2014

Your example doesn't paste gribberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

Anyway, if you want to remove everything except Unicode letters and digits (whatever language), whitespaces, punctuation and currency symbols (for example) then you can try this:

Local $text = "Abç dêf" & @TAB & "123456.789 - 123000 = 456.789 € (convert to £, ₯ or $ as needed!)" & @CRLF & _
                @TAB & "• First bullet" & @CRLF & _
                @TAB & "‣ Second bullet" & @CRLF & _
                @TAB & "• русский текст" & @CRLF & _
                @TAB & "• 中國文字" & @CRLF & _
                "end of test…" & @TAB & "¿Does that work for you?"
MsgBox(0, "Input text", $text)
Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]|[•‣]", "")
MsgBox(0, "Filtered text", $str)

Of course this is only a sketch which you'll need to adjust to your own needs.

Guy_ · August 12, 2014

Your example doesn't paste gibberish for me, but that heavily depends of how far the end of highlight goes and how your browser deals with things.

You are right. It seems I *did* select too much there...

You are also right it depends on the browser. If I select too far, Firefox gives me an extra kind of space, IE gives me some kind of newline...

However, your new code pointer is already filtering this off!

So in the first minutes, it looks very promising.

Thank You Very Much

However, I'll still have to figure out how to include important stuff like ".,;:/?)!'"&[](){}*@#" cause it seems to filter all of these out (and more probably) ...?

That makes me wonder what else I'll be missing.

And again, the pdf stuff is the least of my worries. I'd rather keep the bullets for other situations (and that seems an easy fix).

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s]", "")

I'm now hoping the chars still missing are a simple "class" or do I have to add them back in manually in some way?

At first glance adding in [:punct:] seems a working fix:

Local $str = StringRegExpReplace($text, "(*UCP)[^\pL\pSc\pNd\pZs\s[:punct:]]", "")

Edited August 12, 2014 by Guy_

JeffAllenNJ · July 28, 2020

StringRegExpReplace($text, '[^[:print:]]', '')

water · July 28, 2020

You noticed that this thread is 6 years old

JeffAllenNJ · August 26, 2020

Yeah, but it still pops up at the top of google search, so I thought I'd supply the answer for anyone else searching.

sorry it took me a month to reply!

Edited August 26, 2020 by jaja714

Sign In

Filtering out control characters from copied text

Recommended Posts

Guy_

computergroove

Guy_

jchd

Guy_

jchd

Guy_

jchd

Guy_

JeffAllenNJ

water

JeffAllenNJ

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content

Is dynamic SERC StringRegExpReplace possible please?

Get only number from webpage

Replace text from table using stringreplace

questions about StringRegExpReplace

Change characters in a string with StringRegExpReplace

Browse

AutoIt Resources

Release

Beta