SCD flow in DMC taking extremely long

July 07, 2009, 08:28 AM

Hi all,
I have a performance issue with a Slowly Changing Dimension flow taking an unacceptable amount of time to load; and i'm looking for some feedback/advice.

It is a relatively simple flow, with one source file with only 63000 records.

The target table has 1.3 million rows.

I have 4 fields designated as logical key fields (Company, Program, Prefix Code, and Unit#).

Does the flow process have to read every record in the target table? every record for each logical? (1.3 mill X 4)?

Would it be beneficial to create one Unique ID in the source to use as "1" logical key field to the target (create the same Unique ID in Target)?

Does anyone have any similar experiences, troubleshooting, etc..?

thank you,
James Golden

WebFocus Developer Studio
DMC v. 768
MySQL/Linux
All different outputs including map technology and using the DMC for our ETL process as well

July 08, 2009, 09:19 AM

Jessica Bottone

First, a few questions.....

How much time is it taking to load?

What do you have the Target properties set to? Particularly, the *LOAD TYPE and the *COMMIT EVERY ROW(S).

Are the 63,000 records from the Source all new records to be inserted into the Target, all existing to be updated in the Target, or some of both?

You mention logical keys. Is there an index on the Target using those logical keys? Are these 4 fields listed first in the Target synonym?

Does the Target already have a unique key?

How many indexes does the Target have?

July 08, 2009, 11:16 AM

Clif

Presumably load type is set to "Slowly Changing Dimensions" and so for each input row a select with a where condition on the logical key values is used to see if they exist in the target. As Jessica suggests indexes on those columns would help.

Otherwise a single key column may also be helpful. But what data type is "Company" and the other key columns? If they are each 4 byte integers then you wouldn't see so much benefit. But if Company is a 64 character string, and so is "Program" then you would.

Also there are performance improvements to SCD processing in 7.6.10 gen 363 or later which should be available for download shortly.

July 09, 2009, 08:31 AM

jgolden

Hi all,
first, thanks all for replying. I currently have no indexes relating to this flow in question. my target is set to "Slowly Changing" and my logical keys, company for example, are not very large for size/type - Int4. in my target, my first field is my surrogate key, and yes, it was previously exisitng. i have that field left as "blank" as the surrogate, followed by 4 fields that are set to "logical Key".

Since my original posting, i have modified my source/target tables and have created a unique ID, basically a CONCAT on those four logical fields and then used that new field as my one "Logical Key" field. This did not improve run time. i tried running it again yesterday, and it was still running after 5 hours before i killed it.

I also have tried switching my "Commit every rows" field to 100000 instead of the default.

I am not looking to load all these 64000 rows, but only those i have modified.

For example, i changed on field in my source where a Type II field would be affected. This would end the pre-loaded key record and create a new key with an open end date. I also changed a record for a Type I field - this would only update a few records. Lastly, i changed a source record for both a Type I field and a Type II field - i would expect to have a new record created, plus updating the Type I fields on the other similar logical records.

That said, i would only see 2 new records, and less than 5 others to be updated. The read seems to be taking all the time.

We're also using a JDBC generic connection to get the source data from IS400 box, and using the MySQL JDBS connection to load the target on a MySQL db.

thanks again for your feedback

July 11, 2009, 09:23 PM

Endre

We ususally find it very helpful to use the dimension both as a target and as a source on the same data flow.

We join the dimension with the original source object on the natural/logical key with outer join on the original source.

In the sql filter we only allow data come through from the sources where the dimension.Active_flag is NULL or (dimension.active_flag = 1 and ((dimension_column_1 <> source_column_1) or
(dimension_column_2 <> source_column_2) or
(dimension_column_n <> source_column_n)))

this way the only records that are bounced against the dimension where at least one of the type1 or type2 columns are changed or it is new record.

Endre