Discussion:
removing hidden characters
(too old to reply)
Bucher Scott
2008-08-07 17:47:08 UTC
Permalink
I am attempting to import a csv file (although the same problem applies to tab-delimited .txt file) that I created from an excel file I received. Apparently the file contains hidden characters which produce errors when importing to SAS when they occur in numeric fields. I have opened this file in Word, where the characters are visible as áá. I deleted these characters and the import then worked fine.

It seems there must be better ways to dealing with hidden characters, but I do not know where to begin. Is there a way to view characters in notepad? What would be a good approach to determining if a file contains hidden characters or scrubbing the file of these characters prior or during importing to SAS?

Thanks,
Scott Bucher
SAS Programmer
Office of Accountability
NYC Dept. of Education
./ ADD NAME=Data _null_,
2008-08-07 19:48:00 UTC
Permalink
You may be able to just remove the characters without knowing what the are...

input @;
_infile_ = compress(_infile_,,'s');
input @1 <your input specs>;

Assuming that they are in the group define by 's'.
Post by Bucher Scott
I am attempting to import a csv file (although the same problem applies to tab-delimited .txt file) that I created from an excel file I received. Apparently the file contains hidden characters which produce errors when importing to SAS when they occur in numeric fields. I have opened this file in Word, where the characters are visible as áá. I deleted these characters and the import then worked fine.
It seems there must be better ways to dealing with hidden characters, but I do not know where to begin. Is there a way to view characters in notepad? What would be a good approach to determining if a file contains hidden characters or scrubbing the file of these characters prior or during importing to SAS?
Thanks,
Scott Bucher
SAS Programmer
Office of Accountability
NYC Dept. of Education
Patrick
2008-08-07 20:15:50 UTC
Permalink
Hi Scott

To get rid of hidden characters:
You could use "prxchange" to remove hidden characters (eg. using the
RegEx meta character \s ) -> http://support.sas.com/onlinedoc/913/getDoc/en/lrdict.hlp/a002288677.htm
Compress with appropriate modifiers or using a var which contains all
hidden characters might be another way.

As for your special case where it's only about numbers: You could read
in the string into a temporary character variable, compress everything
except digits (using this new SAS9 modifiers for the compress
function) and then input the temporary char var into the desired
numeric var.

To find hidden characters:
prxmatch could be used, or find (with a variable containing all hidden
characters), or indexc - and lots of other ways of doing the job.

I personally prefer Regular Expressions for these kind of tasks.

Have a look at an ASCII table which values you don't want to accept.

You need a HEX editor/viewer to "see" hidden characters. I myself use
a free tool called HxD - but there might be better ones. It does the
job for me though.

HTH
Patrick
./ ADD NAME=Data _null_,
2008-08-07 21:10:18 UTC
Permalink
On Thu, 7 Aug 2008 14:48:00 -0500, ./ ADD NAME=Data _null_,
Post by ./ ADD NAME=Data _null_,
You may be able to just remove the characters without knowing what the are...
hi,
i would not recommend a cure without knowing the cause. ... Well, dr. house
(on tv) seems to be doing this all the time, so maybe it is ok for such an
expert like data _null_ to do this...
Yes I agree knowing the cause is good. Plus I try to be nicer to
people than Dr. H.

I was not actually suggesting not knowing the "cause" just not knowing
the values that need to be "cleaned".
before reading it into sas, i would either checksum the file, or even ftp
back to the source and compare the files just to make sure that the data are
not corrupted during the file transfer.
The LIST statement is also good to examine input records, using SAS.
Plus the automatic HEX display is nice when there are non printable
characters.
just my two cents...
cheers,
chang
Chang Chung
2008-08-07 21:01:18 UTC
Permalink
On Thu, 7 Aug 2008 14:48:00 -0500, ./ ADD NAME=Data _null_,
Post by ./ ADD NAME=Data _null_,
You may be able to just remove the characters without knowing what the are...
hi,

i would not recommend a cure without knowing the cause. ... Well, dr. house
(on tv) seems to be doing this all the time, so maybe it is ok for such an
expert like data _null_ to do this...

before reading it into sas, i would either checksum the file, or even ftp
back to the source and compare the files just to make sure that the data are
not corrupted during the file transfer.

just my two cents...

cheers,
chang
Bucher Scott
2008-08-07 21:24:34 UTC
Permalink
Hi,

The 'cause' is that Excel files were distributed as a means of data
collection, but there were no restrictions placed in the workbooks as to
what could be entered. The values themselves are not a problem, but on
some workbooks hidden characters crept in; I have no idea how or why.

All the suggestions have been great, I will have to chew over the
different approaches presented thus far (use of rank, compress,
prxchange, etc.) and try to put together a general data cleaning macro
as I expect this will be a recurring event.

Many thanks.

Regards,
Scott


-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-***@LISTSERV.UGA.EDU] On Behalf Of
Chang Chung
Sent: Thursday, August 07, 2008 5:01 PM
To: SAS-***@LISTSERV.UGA.EDU
Subject: Re: removing hidden characters

On Thu, 7 Aug 2008 14:48:00 -0500, ./ ADD NAME=Data _null_,
Post by ./ ADD NAME=Data _null_,
You may be able to just remove the characters without knowing what the
are...

hi,

i would not recommend a cure without knowing the cause. ... Well, dr.
house
(on tv) seems to be doing this all the time, so maybe it is ok for such
an
expert like data _null_ to do this...

before reading it into sas, i would either checksum the file, or even
ftp
back to the source and compare the files just to make sure that the data
are
not corrupted during the file transfer.

just my two cents...

cheers,
chang
./ ADD NAME=Data _null_,
2008-08-07 22:10:40 UTC
Permalink
To follow up. Using 's' as I suggested is not correct. Or at least
not for what I think we/I want to do. I was doing similar by with a
variable in a SAS data set.

Any way the correct COMPRESS that I ended up using is

x = compress(x,,'wk');

Keep writable characters.
Post by ./ ADD NAME=Data _null_,
On Thu, 7 Aug 2008 14:48:00 -0500, ./ ADD NAME=Data _null_,
Post by ./ ADD NAME=Data _null_,
You may be able to just remove the characters without knowing what the are...
hi,
i would not recommend a cure without knowing the cause. ... Well, dr. house
(on tv) seems to be doing this all the time, so maybe it is ok for such an
expert like data _null_ to do this...
Yes I agree knowing the cause is good. Plus I try to be nicer to
people than Dr. H.
I was not actually suggesting not knowing the "cause" just not knowing
the values that need to be "cleaned".
before reading it into sas, i would either checksum the file, or even ftp
back to the source and compare the files just to make sure that the data are
not corrupted during the file transfer.
The LIST statement is also good to examine input records, using SAS.
Plus the automatic HEX display is nice when there are non printable
characters.
just my two cents...
cheers,
chang
Loading...