[CLOSED] How to detect no agents available from a client?

April 30, 2019, 05:07 AM

Wep5622

I'm hoping that someone ran into this earlier and knows the answer.

Our situation is as follows:

We fire off a bunch of WFServlet URL calls that are supposed to start an ETL process each. We log that fact in a database directly (bypassing WF), so we know which processes were (supposedly) started.

The initial WFServlet calls are fairly quick, as they just initialize a CMASAP call that moves the real work to a background agent from the DFM_DEFAULT services asynchronously. Those agents were expected to become available quick enough (unfortunately we forgot to turn on queuing on the agents - I just remedied that).

There is however no trace of some of those processes that were reported as started in the database table.
Their progress within WF normally gets logged to another database table, errors and warnings (which we always have) go to a log-file per ETL source and we should see entries in the log of our HTTP server (IIS 7) for each request as well.
However, we don't see any of that for 70 of our 98 processes. No trace at all!

Unfortunately, we omitted logging the response at the client side, so we have no idea what the client actually saw that could indicate a failed or aborted connection. Does anyone happen to know what that looks like?

We're nowhere near the Bermuda triangle, so we should be able to find some evidence of what failed...

This message has been edited. Last edited by: FP Mod Chuck,

April 30, 2019, 10:50 AM

FP Mod Chuck

Web5622

I can't tell if you tried to query the etllog or etlstats tables to see if there are any indicators in there.

May 01, 2019, 05:12 AM

Frans

Are you sure the processes we're fired off to WebFOCUS in the first place? Do they appear with a 200 status in the access log?

May 01, 2019, 08:54 AM

Wep5622

@Chuck: I tried neither. Instead, I looked at IIS's HTTP logs, and at the error-log that our procedure creates by re-allocating EMGFILE to a new FILEDEF, per:

FILEDEF EMGFILE DISK &LOGPATH (APPEND
SET EMGSRV=FILE
-RUN

However, now that I checked, there are some 30-ish of these processes in the ETLLOG, which is again far too few. Those are most likely the processes that did get through and that we also witnessed in the other logs.

We would actually like to not write to the ETLLOG at all, as with debug flags enabled (through -SET parameters), these fexes sometimes create so much output that we reach the max file-size for an XFOCUS file (32GB) and then we get into a crash loop of the FOCUS service that eats up pretty much all available CPU time instead of the ETL processes that were supposed to be running. Thankfully, with debugging off, log sizes are much more reasonable.

@Frans: I only see any statuses at all for the 30-ish processes of which we know they ran. The other 70 processes left no trace - no HTTP 200 for sure, as there are no lines for these in the HTTP logs at all.
Whether they fired off (by Apache Nifi) is one of the things I'm trying to investigate. Unfortunately, it doesn't look like Nifi logged anything useful; we don't see anything in there that could point to either the successful or the lost jobs (We need to set up some logging there).

That is in fact the reason I was wondering what a client sees when the HTTP connection to the reporting server cannot be accepted, for whatever reason.
We're effectively causing a DOS scenario here; some scaling is clearly necessary, that's one of the lessons-learned from this run

May 01, 2019, 10:53 AM

FP Mod Chuck

Wep5622

You better open a case with techsupport.

May 01, 2019, 01:11 PM

Frans

I think the message they get when the server is running fine: No free agents available on the server to process the request at this time.

But it could also be that the server was in stress and the requests timed out.

As you mentioned, setting up a queue is a good first step and limit the agents so the server can handle the requests.

May 03, 2019, 08:24 AM

dhagen

Have you considered building some type of correlated submission process?

Create a table that has the job name, submission date, job status, status date, etlid. Have your WF jobs write a record to the table with status=QUEUED.

On the ETL side, have a scheduled pickup process that runs every 5 minutes or so that queries the table for QUEUED jobs (you can even put a limit of 5 or 10 to control the amount running). The pickup job then updates the status to PICKED, and then submits them with a CMASAP.

The running jobs then changes the status to RUNNING, and then eventually COMPLETED when done.

This way, you have an independent table with the current status of the jobs without having to rely on direct submission. You can always then build a report to show the status, then drill down to get the stats and log when necessary.