[SOLVED] how to validate csv input file

September 09, 2014, 05:16 AM

Martin vK

Hi,

I need to read, validate and process a csv (COM) source file.
I know how to make a synonym and with a Webfocus procedure or DataMigrator flow I can read and process it.

My question is about validating the input. We would like to validate if each record has the correct number of fields, fields are not too long, numeric fields are numeric and date fields have valid dates, et cetera. If the source is a database table you do not have to do such validations, but csv is just a flat text file, so can contain anything that does not match the definitions.

What are best practices (or suggestions) how to do such validations?
I could just run it and Webfocus will surely throw some error if the input is not correct, but preferably we want to give more specific feedback about what is not correct.

Martin.

This message has been edited. Last edited by: Martin vK,

September 09, 2014, 02:25 PM

susannah

write a process that does a -READ against that file, record by record.
Evaluate the contents of each of the variables that you read in, one at a time.
Then do a -WRITE to a new file of the cleaned record.
Eg, if a field is expected to be an integer, and it has rubbish in it:
-READ MYRECORD &field1.A3. &field2.A4. etc
and &field1 is meant to be an integer,
you can test &field1.TYPE to be 'N' or 'A'
It will take alot of coding to properly handle each one of your input fields.
but that's the way to do it. And it can be fun.
I had a case where i had to -READ my file and parse the results into two separate entities, b/c one field was a txt box with a ton of special characters in it. It had come from a comment box on an input form somewhere.

September 09, 2014, 06:07 PM

Alan B

As Susannah has said, it is not straightforward.

However, interestingly, I see from your signature that you have Hyperstage. As an assumption, I take it that this is not a one off file.

If this assumption is correct then one approach I might take is to utilise the power of the Hyperstage DLP (Distributed Load Processor).

Create a hyperstage table to mimic the structure that the CSV file should be in, with all the correct field formats, using standard SQL create schema/table syntax.

Then create a DLP call to load the CSV file into this Hyperstage table, placing the log (-l) and reject (-r) files into an accessible folder and ensure that the load continues after errors (-c -1). These reject and log files will give a picture of what, if anything, was wrong with the incoming data, (field lengths mainly here, but also valid numeric values and dates), to the Hyperstage table.

Now you can TABLE the Hyperstage data and further validate any data items that have to be within certain criteria.

September 11, 2014, 02:12 AM

Martin vK

Thanks Susannah and Alan for your suggestions. I will work with both to see what works best for us.

Alan, where can I find documentation on the Hyperstage DLP? Since we bought Hyperstage 2 years ago we have received little documentation.

Martin.

September 11, 2014, 10:06 AM

Clif

When you create a synonym for a delimited flat file it's analyzed to determine the least restrictive data type to describe each field. You can specify how many rows to scan so that it won't take forever for a large file, but that also can affect the results if later rows contain different types of values.

Of course this automatic process doesn't know the intent. A field that contains all digits in the rows that are examined will be described as numeric, but if a row contains a non-numeric character than it's described as Alpha.

But what's correct? You may want to treat any non-numeric value as an error. Or it may be a field where some values are correctly character values. For example a Zip code field with mostly USA Zip codes but also contains some Canada postal codes.

Once you have a synonym that describes you delimited flat file the way you want you can use DataMigrator to load the data. In the target transformations on the Validate tab you can reject records that don't meet your criteria. With logging enabled for invalid records they can be written to a separate file for later review and analysis.