Merging Files and Dropping Duplicates - Topic

Focal Point Banner

As of December 1, 2020, Focal Point is retired and repurposed as a reference repository. We value the wealth of knowledge that's been shared here over the years. You'll continue to have access to this treasure trove of knowledge, for search purposes only.

Join the TIBCO Community
TIBCO Community is a collaborative space for users to share knowledge and support one another in making the best use of TIBCO products and services. There are several TIBCO WebFOCUS resources in the community.

From the Home page, select Predict: WebFOCUS to view articles, questions, and trending articles.
Select Products from the top navigation bar, scroll, and then select the TIBCO WebFOCUS product page to view product overview, articles, and discussions.
Request access to the private WebFOCUS User Group (login required) to network with fellow members.

Former myibi community members should have received an email on 8/3/22 to activate their user accounts to join the community. Check your Spam folder for the email. Please get in touch with us at community@tibco.com for further assistance. Reference the community FAQ to learn more about the community.

Focal Point Forums

Hop To Forum Categories

WebFOCUS/FOCUS Forum on Focal Point

Merging Files and Dropping Duplicates

Read-Only Topic

Go

Search

Notify

Tools

Personal Zone	»
Member Directory
Focal Point Forums	»

Add to My Favorites

Printer Friendly Format

Merging Files and Dropping Duplicates

Member

posted

I am using Mainframe Focus for CMS, and would like some suggestions about designing my merge process. I have 2 separate flat sequential files with personnel data - name, address, DOB, etc. I want to bring these 2 files together into 1 Focus DB, and ultimately drop the duplicates, based on a subset of the fields in the data, which I call the "match" fields. Making the match fields the keys to the database, and rejecting matches on the load, would be the easiest, but I'm afraid I might need to use some logic to determine some of the matches. For example, in one input file DOB isn't in every record. I might have to determine if DOB is present in a record, and if so, use that to match on. Otherwise, ignore it. I'm not sure how I could do this with the match and reject logic. Another possibility would be to load every record from both files, then go through the data afterwards, deleting the duplicates. I would probably have more control over the logic to define the matches. Any ideas?

Posts: 8 | Registered: October 13, 2006

Report This Post

Gold member

posted

Do you have an estimate on the number of records in both files? How to process will depend on volume. Less than 100K, not a problem with MODIFY or MATCH, but 200K-1M introduces more constranints.

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Member

posted

Jim,
One file has about 462,600 records at lrecl 128, and the other has about 430,000 records at about the same lrecl.

Posts: 8 | Registered: October 13, 2006

Report This Post

Ignored post by FredD posted

Gold member

posted

That's a good size FOCUS DB in the end.

Here's a few more questions and recommendations.

1) Do you have the keys worked out e.g. SSN or emp. ID makes up a unique key for the employee, or will you have to generate a key on uniqueness? (The latter is a little more complex, where it takes full name, DOB, sex and maybe another qualifier to represent one person.)

2) If you do have the keys worked-out, see # 3. I'd extract only the relevant key information from both files using TABLEF (no sorting) and put into HOLD1A (file A), HOLD1B (file B). Then sort these by the keys you want to HOLD2A, HOLD2B respectively. You can then query these files and use MATCH FILE to see just how many match and output all the desired key data to a single HOLD file.

3) Is there any delimiter(s) that can aid in splitting the data, say active vs. inactive employees, or some other criteria that would simplify (and reduce) the number of records you need to process to generate reports/graphs/stats. etc.? The time to do it is now, but if this is just a short term exercise, it probably won't matter.

3) If you're loading into a hierarchy, and you have the primary keys defined, you can load from a sorted HOLD file (HOLD2A or HOLD2B from #2). With < 500K, and you INCLUDE [but reject matches on the first load), you should be alright interactively, just as long as you concentrate on the parent segment with keys only. The second update can reject ON MATCH, but UPDATE [if other non-key fields on the segment), or just GOTO TOP and bypass since the first load has satisfied the unique load.

4) If your design is a hierarchy, and you know in advance where to put everything, then extract from the source files the keys and all data with TABLE and HOLD as HOLD3A/3B. Do a MODIFY update per segment etc. You can update segments/fields from each file, A or B as needed.

I hope this helps. There’s more to it. Jim

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Gold member

posted

Just wondering how you are doing on this.
If you want to give me a call, I'm available today at 2PM EST. Thanks, Jim (443) 348-4219

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Member

posted

Jim, regarding item...
1) The files contain peoples' names and address information from different sources, and we want to bring them together into one file, while trying to eliminate some obvious duplicates based on some matching data, such last name, first name, address, DOB (these matching criteria are still being worked out). I have no keys determined yet. I have a unique ID in 1 file I can use, and some possible unique fields in the second (not the same as the unique ID in the first). Or I can generate one there to enforce uniqueness.
2) The fields that I want to match on may not always contain information, and will probably be different from the unique fields in #1, above. That's what concerns me about the standard Modify MATCH process. But I do like the idea of the MATCH FILE process.
3) I can split the data up by county (14 of these) if I want. I don't expect to do much reporting - mainly just the load to merge the files (and drop any matches), then dump the resulting records out to a file for further processing elsewhere. Performance won't be an issue (other than me waiting forever as I load :-)
4) My original design was a single-segment Focus file, just to keep processing simple. I didn't think of any advantages for a hierarchy.

Posts: 8 | Registered: October 13, 2006

Report This Post

Ignored post by FredD posted

Gold member

posted

Fred:

Here’s some ideas and suggestions.

1) Take the MFD below (in the next post), edit, save it, and print for ref. It’s an example that can help significantly.
2) Do the CHECK files against it.
CHECK FILE HASH1 PICT
CHECK FILE HASH1.PF_TRANS_NO PICT
CHECK FILE HASH1.SF_TRANS_NO PICT
3)Print the CHECK FILE output listings and study for a while.
4) Based on your information, I’d look at file 1 with 460K records as ‘primary’, the second file as secondary (430K records).
5) Based on information in file 1 (I’m hoping you have at least Name, DOB as a minimum in both files), you’re trying to ‘hash’ a unique key for each employee record.
6) From my first posting, I strongly recommend doing a TABLEF/HOLD and compute a transaction number for both file 1 and file 2. you want to use these for queries, and to load from.
7)With 14 countries or so, I would guess that with US data you may have a SSN; with European data, other identifiers. For other countries, I’m uncertain. (I hope you have at least LN/FN,MI/SEX,DOB in both; if not, the problem requires deeper analysis.)
8) I’d do queries by country and assign a numeric country code prefix to a unique incremental counter during the load. That way, it may help in solving ‘nomatches’ etc.
9) I realize you have key type data in file 2 that needs to be with key type data in file 1, but you must solve the file 1 unique key issue first, and then update the file 2 data to it (as alternate keys or just data).
10) Once you process file 1 hashing out a unique key (you can use the sub-commands: START 1 STOP 1000 etc. to do initial incremental loads to see how much gets loaded etc.). You should have 1 employee for every record in primary file 1 (unless dups. are in the file and they are there to supercede prior data or are erroneous – I can’t tell).
11) Your initial load should be just to hash out the keys, no other data stored, FIXFORM FROM HOLD-primary file 1.
12) With FOCUS files, the objective is to store as many instances (employees) on a 4K page as possible. To do that requires design, else the file is very inefficient (holding just a few employees say per page). In a VM environment, when you load inefficiently, the 4K paging for your login region will escalate at a rapid rate at which point the OS will flag Your region and slow down your paging rate because too many system resources are being consumed (I worked on VM for 8 years).
13) MATCH FILE won’t buy you anything either if you’re comparing the original source files or the first set of HOLD files. The less it has to work with, keys and data, the better, else it will consume resources.
14) On the parent segment in a FOCUS file, you want high level key information, and length of fields plus 4 byte pointers adds to the space needed to store one employee. With the design below, you have one pointer for the parent segment, and 2 more for the ‘transaction segments’, or 12 bytes. Add that to your total field lengths (try to keep in mult. of 4 or pad up) and you get 60 bytes in the example. The smaller, the better as for getting more on a page. (Adding indexes adds more space to the overall DB size too.)
15) Keep in mind that the keys used are up to you and their lengths. The only thing stored other than keys from file 1 is the unique PF_TRANS_NO (from the HOLD file) for every INCLUDEd record. It keeps MODIFY fast, the paging rate is nominal, and the post-load stats with ? FILE and ? FDT will be much lower than what you have now.
16) If this design could help process file 1 completely, then add an INDEX to LN and REBUILD it. This will help with the file 2 update process (you MATCH LN, on NOMTACH you can GOTO another MATCH against all the keys etc.)
17) The MODIFY with file 2 should also run faster since it is MATCHing to the keys (again, whatever the key structure is that is worked out from file 1). On a MATCH CONTINUE – MATCH SF_TRANS_NO ON NOMATCH INCLUUDE (from HOLD file; always unique; ON MATCH REJECT never happens). The time to process should not be too long either.
18) Once you have these components, you can update the database from file 1 with a MODIFY #3: MODIFY FILE HASH1.PF_TRANS_NO. This is an inverted view, just like the CHECK FILE picture. FIXFORM FROM HOLD… and this time you MATCH WITH-UNIQUES PF_TRANS_NO / ON MATCH UPDATE fields / ON NOMATCH INCLUDE. When done, you have all file 1 data in the database, keys and data.
19) Do the same for the secondary file.
20) Note that if a field exists in the unique segment for file 1, and it comes in file 2, you can control (with CASE logic in MODIFY) what you want to do with the field from file 2: You can MATCH the trans segment, NEXT LN, NEXT PF_TRANS.. NEXT PF01_DATA to get to that segment and either update the value or include it.
21) Analyze/query what you have, determine what key fields you want in the parent segment (that reside in the uniques). Maybe you want to change the updates etc. – either way, you can start over with a modified MFD, or with REBUILD etc.
22) At this point, what you have may satisfy the requirements, if not, you can come up with a different file structure and load it with extracts from the first in whatever order you desire etc. by keys.

I hope this helps!! Jim

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Gold member

posted

$
FILE=HASH1 ,SUFFIX=FOC,$
$
SEGNAME=SEG01 ,SEGTYPE=S06,$ < Could be S0 [long story)
FIELDNAME =LN ,H01, USAGE=A16, ACTUAL=A16,$ > ‘Hash’ keys; depends on
FIELDNAME =FN ,H02, USAGE=A12, ACTUAL=A12,$ your criteria and what
FIELDNAME =MI ,H03, USAGE=A1, ACTUAL=A1,$ you have from PF 1.
FIELDNAME =SEX ,H04, USAGE=A1, ACTUAL=A1,$
FIELDNAME =DOB ,H05, USAGE=YMD, ACTUAL=DATE,$
FIELDNAME =SSN ,H05, USAGE=I9, ACTUAL=I4, $
FIELDNAME =PF_KEY1 ,H06, USAGE=I9, ACTUAL=I4, $ FIELDNAME =STATUS ,H07, USAGE=A1, ACTUAL=A1, $ A OR I
FIELDNAME =FILL1 ,H09, USAGE=A1, ACTUAL=A1, $ 44 BYTES SEG LENGTH + 12 for pointers
$ = 60 bytes ; about 66 emp. per 4K page
SEGNAME=PF01, PARENT=SEG01 ,SEGTYPE=S01 ,$ PRIMARY FILE TRANS KEY
FIELDNAME =PF_TRANS_NO ,H10, USAGE=I9, ACTUAL=I4,$
$
SEGNAME=PF01UNQ1, PARENT=PF01, ,SEGTYPE=U ,$
FIELDNAME =PF01_DATA1 ,H11, USAGE=A14, ACTUAL=A14,$ PR FILE DATA FIELDS
FIELDNAME =PF01_DATA2 ,H12, USAGE=YYMD,ACTUAL=DATE,$
$
SEGNAME=SF02, PARENT=SEG01 ,SEGTYPE=S01 ,$ SECONDARY FILE TRANS KEY
FIELDNAME =SF_TRANS_NO ,H20, USAGE=I9, ACTUAL=I4,$
$
SEGNAME=SF02UNQ1, PARENT=SF02, ,SEGTYPE=U ,$ SEC FILE DATA FIELDS
FIELDNAME =SF02_DATA1 ,H21, USAGE=A14, ACTUAL=A14,$
FIELDNAME =SF02_DATA2 ,H22, USAGE=YYMD,ACTUAL=DATE,$

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Member

posted

Jim,
I have received your posts (notes and MFDs). I want to print them out and take a closer look. Thank you!

Fred

Posts: 8 | Registered: October 13, 2006

Report This Post

Ignored post by FredD posted

Member

posted

Hi Jim,
I have looked at your posts, and I can see many advantages in your design. I have a few questions:
1) I'm afraid you misunderstood my delimiter field - it should be "county". We have 14 in the state. Where would the numeric county code prefix go - on the Trans_No keys?
2) Possibly we could be adding in a 3rd file, with possibly a new set of unique fields. Would I then rebuild the file with a new order, and create a new set of child segments similar to the structure I have now?
3) An issue I might be facing - some records would need one set of unique fields, and some records would need another. For example, in your MFD you have DOB in the unique key fields. All records in file 1 contain DOB, but only some records in file 2 will have DOB. So if it is there, we can use it to help, but if it is not, we don't want to reject the record - we'll have to ignore DOB and use another field. Same thing for SSN. Not sure how to handle this. Would it be better to leave DOB and SSN out of the set of unique fields altogether in this case, and focus on fields that all files have in common?
4) One query I will need to run is to create a unique list of records. After loading, I presume I will have some records with data only in segments PF01, some with data only in segments SF02, and some in both. After deciding which segments I would use for the last group, how would the query look to generate that list?

Hope these questions are clear.
Fred

Posts: 8 | Registered: October 13, 2006

Report This Post

Ignored post by FredD posted

Gold member

posted

Fred,

I’ll respond by your question numbers:

1) Great! Working with international data is a real challenge; counties, much easier. If you will be querying on County (or any other identifier field like State, DOB, Sex, etc.), put those fields on the parent segment, SEG01. Anything else, on the Unique segment for that trans file (PF or SF).

2) REBUILD would be easy, and yes, you could just ‘shadow’ the same structure for the 3rd file as SF03 - ‘secondary file 03’.

3) Any key type identifier fields like SSN and DOB should be in the parent segment. If you MATCH on the keys with SF02, and “SF02_DOB NE D.DOB”, then flag that segment using the STATUS code (you can make up code for the type of error: DOB =01, etc.), and LOG the REJECT to a file for review after the MODIFY (it might be easier just to log the SF02 TRANS_NO and use that to JOIN back to the HOLD file and see the data in a formatted list).

An order of MATCH may be: SSN 1st, if that is found, then you can compare LN, FN, MI, and if that matches, fine, continue to update. If SSN doesn’t match, then try to match on LN, FN, MI (SEX?), DOB. I’d REBUILD/INDEX on the FOCUS file only after doing the initial load with just keys from PF01. Index: LN, SSN, and DOB. Search time can be increased if these key fields are indexed. (The only concern I have is SEGTYPE: S0 or Sn, and at what point I’d use both methods on the parent segment. With S0, it is going to search from the beginning of the chain every time, but it gives more flexibility to compare the transaction values to the db values.) (I need to think about this a little more – pressed for time today

4)CHECK FILE HASH1.PF_TRANS_NO (or SF_TRANS_NO) - Either way, all the data is on one retrieval path. With a view using PF, segment 05 (on the right) is descendent to the parent as a unique and considered ‘part of the parent’, so it and anything descendent (segments 02, 03, and 04) are on the same retrieval path. You get to choose what you want in your TABLE FILE HASH1.PF_TRANS_NO (inverted view).

Other tips to consider:

a) SOUNDEX of LN to get a unique sys gen ID for a last name
b) Unique segments really help when you know, say, only 20% of the records have a value. Placing fields on a unique (even multiple “uniques”) with other fields that may have values 40% of the time etc. can be done. The only penalty the is the additional 4 bytes in the parent for a pointer.
c) REBUILD, REBUILD REBUILD!! A great utility during this type of quick development since you may drop fields, promote to a parent, demote to a unique, create another unique etc. Just keep track of your MFDs and datasets!
d) Queries like COUNT FIELD BY FIELD against the initial HOLD files will give you good stats. on what you have and what is missing. This aids in knowing what to expect and what fields are candidates for (additional) unique segments.
e) Try to keep the uniques on full word boundaries and those that are sparse “x %” of the time, then consider putting on another unique to optimize what you have on the primary unique segment(s).

I hope this all “gels” a little now, especially on how to use the HOLD files with an incremental trans. no. and then use that in the load/updates. When viewing the file structure, one thinks it is inefficient at first glance, but with just one transaction instance and one unique for each parent, it’s just splitting data for fast update and retrieval (this is an atypical hierarchical design).

I’ll be out ‘till Friday. Thanks, Jim

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Member

posted

Hi Jim,
I have a couple more questions about your latest responses. Re item...
3) Will Focus let you skip over keys, if you don't want to match to them? For example, in your MFD for Hash1, if I wanted to bypass DOB if there is no data for it, could I do it?
4) If I use an inverted view on PF_TRANS_NO, what happens to the segments that have no data in PF01, but do have segments in SF02? How would I pick these up in the full-file query?

I do think this is generally making sense to me. Thanks.

Posts: 8 | Registered: October 13, 2006

Report This Post

Ignored post by FredD posted

Gold member

posted

Hi Fred:

In response to your questions, on:

3a) If you MATCH on the keys and DOB is part of the MATCH and no value exists, on a NOMATCH INCLUDE, the DOB will be zero. So, on an INCLUDE, with no value present, it will insert zeroes (or blanks for alpha) fields. You can however set the default values to what you want, say ‘2006-01-01’ for a DOB.

3b) During an update with SF02, your first MATCH on keys could be SSN/LN/FN/MI/DOB. But, since DOB is a key, you won’t be able to update it unless before the update you change the “S5” SEGTYPE to an “S4” (just have a copy of the MFD available to rename etc.). Then, ON NOMATCH REJECT/ON MATCH UPDATE DOB.

3c) You can control if you want to use the DOB from SF02 or keep the PF01 DOB:
ON MATCH IF D.DOB EQ SF02_DOB GOTO TOP;
ON MATCH UPDATE DOB

Once the update is done, you can always use the original “S5” SEGTYPE MFD for access.

I’m not familiar with your data, and it has been a long time since I did an exercise like this one. If duplicates can exist for the same SSN/LN/FN in SF02, then your CASE logic may be a little more involved than this example. Can you have duplicates in PF01? With identifier data being so critical, I’d be sure to LOG the REJECT transaction number to a file.

4) Looking at the inverted view with PF_TRANS_NO at the top, data in the PF01UNQ1 (unique) segment is available for compare against PF02UNQ1 (segment 04). A query to get one set of data values from both could look like this:

DEFINE FILE HASH1.PF_TRANS_NO
OUT_SALARY / D12.2 = IF PF01_SALARY NE 0 THEN PF02_SALARY ;
OUT_YTDNET / D12.2 = IF PF01_YTDNET NE 0 THEN PF02_YTDNET ;
OUT_YTDHRS / I9 = . .

-For each common field, use a DEFINE to pull one or the other.

FLAG1 / I1 = IF PF01_field EQ 0 AND PF02_field EQ 0 THEN 1 ELSE 0 ;
END
TABLE FILE HASH1.PF_TRANS_NO
BY keys
PRINT PF01_DATA_fields . . . OUT_SALARY OUT_YTDNET . . . PF02_DATA_fields FLAG1 FLAGn
ON TABLE . .
WHERE FLAG1 EQ 0 etc.
END

 When no instance exists, then the default values are 0 or blanks for those fields in that unique segment.
 Other methods are possible using MISSING, but this should work.
 If critical values must be present, then set flags to use for queries and analysis.

If you’ve made it this far, you’re doing fine. I just hope your data is clean and you can automate the process if this is going to be an ongoing operation. Thanks!! Jim

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Member

posted

Hi Jim,
I think I'm getting close, just a few more clarifications:
4) Would this inverted view and comparison between unique segments work, even if segment PF01 (and child PF01UNQ1) does not exist for a specific record (would have SEG01 and SF02)? Seems like there would be no path to get to that SEG01 in that case.

Also new question: I missed what field PF_KEY1 (in segment SEG01) is being used for. Could you go over that?

Thanks for your patience.
Fred

Posts: 8 | Registered: October 13, 2006

Report This Post

Ignored post by FredD posted

Gold member

posted

Fred:

“It’s coming back to me slowly” i.e. you’re right in that if you loaded just SF02 data, retrieval from a PF01 View would not get any SF02 data; that would include LN, FN, etc. and anything in the PF01UNQ1 (unique) segment.

Update Solution: In SEG01, you need a flag field: SOURCE – A1; values: P, S or B (“Primary”, “Secondary”, or “Both”). During the primary file load, the field should be set to “P” for all instances loaded into SEG01. When updating with SF02, on a NOMTACH INCLUDE, the SOURCE should be set to “S”. If a MATCH during the SF02 update process, set the field to “B” (this should be done only if any one field from SF02 was used to update any field in SEG01, or was INCLUDED into SF02 and SF02UNQ1.

Retrieval solution: Using my last example on how to compare and retrieve from both unique segments and output the desired value, that TABLE can include a WHERE SOURCE EQ ‘P’ OR ‘B’. This would be extract 1 to a HOLD file. The HOLD file can be appended to (DYNAM .. MOD REUSE ) and then extract 2 would select all fields from PF01, SEG01, and PF01UNQ1 WHERE SOURCE EQ ‘P’. Just copy this code and rename the “PF01_” prefixes to “SF02_” for extract 3 and use WHERE SOURCE EQ ‘S’. By selecting the same set of fields for each query, you’re essentially doing a SQL UNION ALL. Once extract 3 is complete, you will have one HOLD file with varying content (be sure to HOLD the SOURCE field too for subsequent queries). You can reference fields in the retrieval path even if you know they are missing, and the default values will be set for them e.g. 0 for integer and blank for alpha.

Another view: Try this: CHECK FILE HASH1.PF01UNQ1 PICT This lets you see it in one straight path. It doesn’t buy you anything, but may be easier to understand. As for extract performance, I’d stay away from this view unless I was screening on fields in PF01UNQ1. With SOURCE being at the third level, I’d rather use the PF01 view. But, depending on your needs and performance, either way is valid.

PF_KEY1 in SEG01 is just a numeric incremental key that you could use during the initial load with PF01 data. It gives more flexibility to the load and rebuild processes during the “hash out” of keys process. Since it is unique, it could be used as a primary key during a REBUILD with a new MFD (SEGTYPE= S1/S0) and PF_KEY1 was the first field. The other “key” fields like SSN, LN etc. could then be considered data fields during the REBUILD. (Other options like S0 segments with indexes placed on LN_FN_MI or SSN can be used for MATCHing and special updates etc. – “long story”.)

I hope this helps. I’ll be out this afternoon for meetings and errands.

Thanks!
Jim

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

Member

posted

Hi Jim,
I think I am set now. I have begun running some large tests, and things are coming together well.
Thank you very much for your assistance.
Take care,
Fred

Posts: 8 | Registered: October 13, 2006

Report This Post

Ignored post by FredD posted

Gold member

posted

Fred:

Sounds great!

Jim

WebFOCUS 7.6.11, WINDOWS, HTML, PDF, EXCEL

Posts: 77 | Location: Baltimore | Registered: May 31, 2006

Report This Post

Ignored post by Jim_at_LM posted

<Umar Farook S>

posted

Hi Jim
I am trying to do something with external files here.I need some help from you.

Objective: 2 external files to be compared and matching and unmatching rows to be displayed as output of webfocus report.

2 external files for ex:excel files with 10 coloumns in each file.
My key fields are 5 fields in that excel based on which i need to compare both the excels in webfocus and produce results.
I like to know
1.how to have the 2 files as input to webfocus
2.how to do compare based on 5 fields between both excels.Like if Id matches check for name then ...if all the 5 fields matches in both the files display them.
3.we may also need to display the unmatched rows from both the external files.

Please help me out on this issue to start off with.

Report This Post

Master

posted

I think IBI has an extra cost adapter that will read excel files. If that is not an option, here's what I do.

1. Save excel files as tab delimited.
2. Create a master to read them. For example:

FILENAME=hb1143data,SUFFIX=FIX, $
SEGNAME=SEG1 ,$
  FIELD=LDATA,,A160,A160, $

3. Write fex to unstring data. For example,

 FILEDEF HB1143DATA DISK &TMPDIR..hb1143_0708.txt ( LRECL 160 RECFM V
-*
 DEFINE FILE HB1143DATA
   TAB/A1=HEXBYT(9,'A1');
-*
   WID/A4=GETTOK(LDATA,160,1,TAB,4,WID);
   ROWID/I4S=EDIT(WID);
-*
   F3LST/A3=GETTOK(LDATA,160,2,TAB,3,F3LST);
-*
   SSN/A9=GETTOK(LDATA,160,3,TAB,9,SSN);
-*
   ACASE1/A8=GETTOK(LDATA,160,4,TAB,8,ACASE1);
   CASE1/I8=EDIT(ACASE1);
-*
   ACASE2/A8=GETTOK(LDATA,160,5,TAB,8,ACASE2);
   CASE2/I8=EDIT(ACASE2);
-*
   ETHNIC/A3=GETTOK(LDATA,160,6,TAB,3,ETHNIC);
-*
   GNDR/A1=GETTOK(LDATA,160,7,TAB,1,GNDR);
-*
   WDOB/A10=GETTOK(LDATA,160,8,TAB,10,WDOB);
   ADOB/A8YYMD=EDIT(WDOB,'9999$99$99');
   CLDOB/YYMD=DATECVT(ADOB,'A8YYMD','YYMD');
-*
   MATCH_KEY/A13=
     IF F3LST NE ' ' AND GNDR NE ' ' AND ADOB NE ' '
       THEN F3LST | GNDR | ADOB ELSE ' ';
 END
-*
 TABLEF FILE HB1143DATA
  PRINT ROWID CASE1 CASE2 SSN MATCH_KEY
     IF ROWID GT 0
     ON TABLE HOLD
 END

The data is now in a form that you can do pretty much whatever you want.

In FOCUS since 1985. Prod WF 8.0.08 (z90/Suse Linux) DB (Oracle 11g), Self Serv, Report Caster, WebServer Intel/Linux.

Posts: 975 | Location: Oklahoma City | Registered: October 27, 2006

Report This Post

Ignored post by jgelona posted

<Umar Farook S>

posted

Hi jgelona

Can you please explain me the process what you have used.
It will be helpfull for me

Report This Post

<Umar Farook S>

posted

Hi Jgelona,

As a initial step i just tried to read one external TAB delimited file in webfocus but it turned to be wrong.
Please guide me .

::MASTER FILE::
FILE=mrx_external, SUFFIX=FIX
SEGNAME=ROOT_SEG, SEGTYPE=S1, $
FIELD=NUM, ALIAS=SerialNo, USAGE=P12.2, ACTUAL=P8, $
FIELD=NAME, ALIAS=EMPname, USAGE=A20, ACTUAL=A20, $
FIELD=AGE, ALIAS=age, USAGE=I5, ACTUAL=I4, $
FIELD=GENDER, ALIAS=MorF, USAGE=A20, ACTUAL=A20, $

::External file::
NUM NAME AGE GENDER
1 Umar 23 M
2 Vini 24 M
3 Chris 30 M
4 Deb 31 F
5 Ken 40 M

::Fex code::
-* File mrx_externalfile1.fex
-TOP
FILEDEF mrx_external DISK C:\Book1.txt
-RUN

TABLE FILE mrx_external
PRINT NAME
GENDER
AGE
BY NUM
END

Please help me out on this issue.
Please let me know kno where i went wrong in the above piece of codes.
It will be helpful if u can guide me in solving this out.

Report This Post

Expert

posted

Umar,

You might want to read the manual section regarding master files and particularly about USAGE, ACTUAL and FORMAT keywords.

This may help you understand a little -

FILEDEF DATAMAS DISK DATAIN.MAS
FILEDEF DATAIN  DISK DATAIN.FTM
-RUN
-WRITE DATAMAS FILE=DATAIN, SUFFIX=FIX
-WRITE DATAMAS SEGNAME=ROOT_SEG, SEGTYPE=S1, $
-WRITE DATAMAS FIELD=NUM,    ALIAS=SerialNo, USAGE=P12.2, ACTUAL=A8, $
-WRITE DATAMAS FIELD=NAME,   ALIAS=EMPname,  USAGE=A20,   ACTUAL=A20, $
-WRITE DATAMAS FIELD=AGE,    ALIAS=age,      USAGE=I5,    ACTUAL=A8, $
-WRITE DATAMAS FIELD=GENDER, ALIAS=MorF,     USAGE=A20,   ACTUAL=A20, $
-*                                       2                           2
-*            .......8...................0.......8...................0
-WRITE DATAIN        1                Umar      23                   M
-WRITE DATAIN        2                Vini      24                   M
-WRITE DATAIN        3               Chris      30                   M
-WRITE DATAIN        4                 Deb      31                   F
-WRITE DATAIN        5                 Ken      40                   M
-RUN

TABLE FILE DATAIN
PRINT *
END

T

In FOCUS since 1986	WebFOCUS Server 8.2.01M, thru 8.2.07 on Windows Svr 2008 R2
	WebFOCUS App Studio 8.2.06 standalone on Windows 10

Posts: 5694 | Location: United Kingdom | Registered: April 08, 2004

Report This Post

Ignored post by Tony A posted

Master

posted

Umar,

Sorry about not getting back, I've been on vacation for a few days.

Anyway, the way my job works is that the master has just one field, LDATA. The file has variable length records. In this case, the limit is 160 characters and the file is a tab delimited file. The original file was an Excel spreadsheet that I saved using the .txt (tab delimited) option. I delete the heading row before saving and ftp the file to the Reporting Server. Then I just use the GETTOK function to "parse" the data fields, pulling out those I need.

Here, the State has a contract with an outside consultant. The consultant sends me client data (name, dob, ssn, gender, etc.). I have to take that data and attempt to find those clients in our database and send him back the matching and closely matching data records.

In FOCUS since 1985. Prod WF 8.0.08 (z90/Suse Linux) DB (Oracle 11g), Self Serv, Report Caster, WebServer Intel/Linux.

Posts: 975 | Location: Oklahoma City | Registered: October 27, 2006

Report This Post

Ignored post by jgelona posted

<Umar Farook S>

posted

Thanks very much.Its not a problem.I still have little time to sort this issue out.

I will explain at my level best.I have 2 csv files.so i created 2 master files.I am able to read them in webfocus and display them.I have 5 fields in my csv file out of which all 5 fields are key fields which are to be checked for mis-matches between 2 csv files.I have used MATCH FILE concept here to create oen single HOLD file using OLD-NOR-NEW since i need only the mismatches as output in my final output.
I am posting my 2 master files and fex code here for your reference.
Please let me know what changes i need to make and where i have gone wrong.
::Master file1::
FILE=qmtrade1, SUFFIX=COM
SEGNAME=QMTRADE1, SEGTYPE=S1, $
FIELD=MMRefNumber, ALIAS=M01, USAGE=A40, ACTUAL=A40, $
FIELD=TradeDate, ALIAS=M02, USAGE=MDY, ACTUAL=DATE, $
FIELD=Portfolio, ALIAS=M03, USAGE=A40, ACTUAL=A40, $
FIELD=NotionalAmt, ALIAS=M04, USAGE=P16.2C, ACTUAL=P16, $
FIELD=TradeCounterParty, ALIAS=M05, USAGE=A40, ACTUAL=A40, $
FIELD=AssetSubType, ALIAS=M06, USAGE=A40, ACTUAL=A40, $
FIELD=AssetType, ALIAS=M07, USAGE=A40, ACTUAL=A40, $

::Master file2::
FILENAME=qmtrade3,SUFFIX=COM
SEGNAME=QMTRADE3, SEGTYPE=S1, $
FIELD=Number, ALIAS=M001, USAGE=A40, ACTUAL=A40, $
FIELD=Date, ALIAS=M002, USAGE=MDY, ACTUAL=DATE, $
FIELD=Portf, ALIAS=M003, USAGE=A40, ACTUAL=A40, $
FIELD=Notional, ALIAS=M004, USAGE=P16.2C, ACTUAL=P16, $
FIELD=CounterParty, ALIAS=M005, USAGE=A40, ACTUAL=A40, $
FIELD=ASubType, ALIAS=M006, USAGE=A40, ACTUAL=A40, $
FIELD=AType, ALIAS=M007, USAGE=A40, ACTUAL=A40, $

::FEX code::
-* File qmtrade3csv.fex
FILEDEF QMTRADE1 DISK \\IMNT14\SHARED\TRANSFER\NEW_TRADE1.CSV
FILEDEF QMTRADE3 DISK \\IMNT14\SHARED\TRANSFER\NEW_TRADE2.CSV
SET ASNAMES=ON
SET NULL=ON
TABLE FILE QMTRADE1
PRINT
AssetSubType
AssetType
BY MMRefNumber
BY TradeDate
BY Portfolio
BY NotionalAmt
BY TradeCounterParty
ON TABLE HOLD AS QMTRADEONE
END
-RUN
-*
TABLE FILE QMTRADE3
PRINT
ASubType
AType
BY Number
BY Date
BY Portf
BY Notional
BY CounterParty
ON TABLE HOLD AS QMTRADEThree
END
-RUN
-*
MATCH FILE QMTRADEONE
PRINT
AssetSubType
AssetType
BY MMRefNumber
BY TradeDate
BY Portfolio
BY NotionalAmt
BY TradeCounterParty
RUN
FILE QMTRADEThree
SUM
ASubType
AType
BY Number AS NUM
BY Date AS DAT
BY Portf AS PORT
BY Notional AS NOTI
BY CounterParty AS CPTY
AFTER MATCH HOLD OLD-NOR-NEW
END
-RUN
-*
DEFINE FILE HOLD
DELIM/A1 = '|';
END
-RUN
SET NODATA=' '
-*
TABLE FILE HOLD
HEADING
"COMPARE OF TWO CSV FILES"
"---- FILE1 ---------------------------- <65 -----------------------FILE2 --------"
"PRINT
MMRefNumber AS 'MM,Ref,Number'
TradeDate AS 'Trade,Date'
Portfolio AS 'Port,folio'
NotionalAmt AS 'Notional,Amt'
TradeCounterParty AS 'Trade,Counter,Party'
AssetSubType AS 'Asset,Sub,Type'
AssetType AS 'Asset,Type'
DELIM AS ' ' IN +4
NUM
DAT
PORT
NOTI
CPTY
ASubType
AType
BY MMRefNumber NOPRINT
BY TradeDate NOPRINT
BY Portfolio NOPRINT
BY NotionalAmt NOPRINT
BY TradeCounterParty NOPRINT
BY NUM NOPRINT
BY DAT NOPRINT
BY PORT NOPRINT
BY NOTI NOPRINT
BY CPTY NOPRINT
ON TABLE NOTOTAL
etc etc....

Report This Post

Silver Member

posted

Hi Umar,

You have posted this question before. Why not just refer to the title and ask them to read it. Still I say you wil need to read up on how the MATCH statement works. But you can also use other methods (not using MATCH) to solve your problem. You can HOLD the first data in a FOCUS database and use MODIFY to extract the answer.

Or you can hold OLD-AND-NEW and then use the result to extract when is not found from the first hold file and repeat this process with the second file.

A message from the darks side, there is!

Posts: 39 | Registered: August 28, 2007

Report This Post

Ignored post by Robert Teo posted

Master

posted

Umar,

Your masters are not correct for COM files. COM files are generally text files. As a result, your ACTUAL is almost always ANN, where NN is the length.

In FOCUS since 1985. Prod WF 8.0.08 (z90/Suse Linux) DB (Oracle 11g), Self Serv, Report Caster, WebServer Intel/Linux.

Posts: 975 | Location: Oklahoma City | Registered: October 27, 2006

Report This Post

Ignored post by jgelona posted

<Umar Farook S>

posted

Hi
I enter data in an excel and save it as CSV file and use it.
Whether this will be a problem.
Whether i have to save it as .TXT and use it.

Report This Post

Powered by Social Strata

Please Wait. Your request is being processed...

Read-Only Topic

Focal Point Forums

Hop To Forum Categories

WebFOCUS/FOCUS Forum on Focal Point

Merging Files and Dropping Duplicates

Focal Point | Privacy Statement | Terms of Service

Copyright © 1996-2020 Information Builders

	View $GS_USERNAME's Public Profile
	Add $GS_USERNAME to my Ignore ListRemove $GS_USERNAME from my Ignore List
	View Recent Posts by $GS_USERNAME
	Notify me of New Posts by $GS_USERNAME

New Since your Last Visit