A Regex for no-break space Unicode Entities

kettle has asked for the wisdom of the Perl Monks concerning the following question:

I have been having an extremely trying 30 minutes or so, trying to regex out no-break space unicode entities, represented in my very large raw text file as \302\240. I was just about to post a request for some help, but figured out a solution to my problem. Perhaps it isn't the best solution, but I was unable to find anything concise which solved my problem, on the web, but I suppose there are others out there who have, or will have the same problem - so here's a very short, very simple solution:

#!/usr/bin/perl -w
use warnings;
use strict;

binmode(STDIN,":bytes");
binmode(STDOUT,":bytes");

while(<>){
   chomp;
   s/\302\240//g;
   s/\s+/ /g;
   print $_."\n";
}
[download]

This completely solved my problem. If it is incomplete, or not a very clever thing to do, please improve it. If it solves somebody elses problem as well - GREAT! joe

2006-09-14 Retitled by planetscape, as per Monastery guidelines

( keep:0 edit:12 reap:0 )

Original title: 'Annoying Problem: solved'

Comment on A Regex for no-break space Unicode Entities Download Code

Replies are listed 'Best First'.
Re: A Regex for no-break space Unicode Entities by bart (Canon) on Sep 13, 2006 at 09:51 UTC
If your file contains "\302\240" for chr(160), that means to me that the file is in UTF8. So if you'd binmode the source file as ":utf8", then you could just scan for "\240". In theory, it's a better (= fewer possible nasty surprises) solution. Somehow, I don't think replacing every nbsp with nothing is such a good idea. I'd leave a space in its place. Otherwise, you'd end up joining words into one, that should remain separate.	[reply]
Re^2: A Regex for no-break space Unicode Entities by ysth (Canon) on Sep 14, 2006 at 08:17 UTC
In theory, it's a better (= fewer possible nasty surprises) solution. I see more possible nasty surprises. Can you elaborate?	[reply]
Re^3: A Regex for no-break space Unicode Entities by bart (Canon) on Sep 14, 2006 at 09:47 UTC
I see more possible nasty surprises. Can you elaborate? Huh? Can you elaborate? The theoretical danger is that by matching individual bytes instead of characters, you might inadvertently match bytes that actually belong to other characters. And by changing just a few bytes instead of the whole sequence making up a character, you might even be creating invalid UTF8. Of course, one of the reasons for the popularity of UTF8 (as opposed to Windows native "2 bytes for each character") is that it's resyncing, it's always possible to recognize start and continuation bytes for multibyte characters, so this problem isn't as stringent as it could have been using other multibyte character representations. There are no whitespace characters with a character code of 128 or above, nbsp (160) is the only almost-whitespace character I know of in that situation. So for this particular application, you're probably in the clear. Still, there's danger lurking in treating byte sequences in a different manner than intended — thus, treating UTF8 as a byte sequence.	[reply]
Re^4: A Regex for no-break space Unicode Entities by ysth (Canon) on Sep 14, 2006 at 15:05 UTC
Re: A Regex for no-break space Unicode Entities by graff (Chancellor) on Sep 13, 2006 at 13:00 UTC
bart is right -- this is a cleaner, safer way: `#!/usr/bin/perl -w use warnings; use strict; binmode(STDIN,":utf8"); binmode(STDOUT,":utf8"); while(<>) { # if you just want to get rid of non-breaking spaces, do this: tr/\xA0/ /; # if you really want to change every kind of whitespace and every stri +ng # of two or more whitespace to a single space, do this instead: s/\s+/ /g; # in utf8 strings, \s matches non-breaking space s/ $/\n/; # (puts back the \n at the end of the line) print; }` [download] (updated to remove incorrect use of "g" modifier on tr///)	[reply] [d/l]
Re^2: A Regex for no-break space Unicode Entities by kettle (Beadle) on Sep 13, 2006 at 13:40 UTC
# of two or more whitespace to a single space, do this instead: s/\s+/ /g; # in utf8 strings, \s matches non-breaking space I read this on a webpage somewhere, but for one reason or another, it did not produce the desired results. The binmode utf8 thing did not work either. Though more unpredictable, and for reasons I cannot completely explain, the byte mode solution was the only one I could get to produce the desired results.	[reply]
Re^3: A Regex for no-break space Unicode Entities by graff (Chancellor) on Sep 13, 2006 at 23:43 UTC
but for one reason or another, it did not produce the desired results. It would be neat if you could show a minimal self-contained example to demonstrate this. It could be you were still missing something simple, like you did `binmode STDOUT, ":utf8";` but then actually read your input from some other file handle (e.g. ARGV), instead of actually piping or redirecting data to the script. And see what the results actually were could help as well.	[reply] [d/l]
Re: A Regex for no-break space Unicode Entities by kettle (Beadle) on Sep 13, 2006 at 13:38 UTC
Both comments: thanks, and very true. It is definitely better to change it to an ordinary whitespace character first, then to the subsequent reduction. For my data this didn't happen to be problem thankfully but in general that is definitely a better practice, and what I would have implemented had it ocurred to me at the time. Thanks!	[reply]


Do you know where your variables are?
	PerlMonks