How to pass a set of rows from one function into another?
I perform very high speed, complex database migrations for a living, using SQL as both the client and server language (no other language is used), all running server side, where the code rarely surfaces from the database engine. Table functions play a HUGE role in my work. I don't use "cursors" since they are too slow to meet my performance requirements, and everything I do is result set oriented. Table functions have been an immense help to me in completely eliminating use of cursors, achieving very high speed, and have contributed dramatically towards reducing code volume and improving simplicity.
In short, you use a query that references two (or more) table functions to pass the data from one table function to the next. The select query result set that calls the table functions serves as the conduit to pass the data from one table function to the next. On the DB2 platform / version I work on, and it appears based on a quick look at the 9.1 Postgres manual that the same is true there, you can only pass a single row of column values as input to any of the table function calls, as you've discovered. However, because the table function call happens in the middle of a query's result set processing, you achieve the same effect of passing a whole result set to each table function call, albeit, in the database engine plumbing, the data is passed only one row at a time to each table function.
Table functions accept one row of input columns, and return a single result set back into the calling query (i.e. select) that called the function. The result set columns passed back from a table function become part of the calling query's result set, and are therefore available as input to the next table function, referenced later in the same query, typically as a subsequent join. The first table function's result columns are fed as input (one row at a time) to the second table function, which returns its result set columns into the calling query's result set. Both the first and second table function result set columns are now part of the calling query's result set, and are now available as input (one row at a time) to a third table function. Each table function call widens the calling query's result set via the columns it returns. This can go on an on until you start hitting limits on the width of a result set, which likely varies from one database engine to the next.
Consider this example (which may not match Postgres' syntax requirements or capabilities as I work on DB2). This is one of many design patterns in which I use table functions, is one of the simpler ones, that I think is very illustrative, and one that I anticipate would have broad appeal if table functions were in heavy mainstream use (to my knowledge they are not, but I think they deserve more attention than they are getting).
In this example, the table functions in use are: VALIDATE_TODAYS_ORDER_BATCH, POST_TODAYS_ORDER_BATCH, and DATA_WAREHOUSE_TODAYS_ORDER_BATCH. On the DB2 version I work on, you wrap the table function inside "TABLE( place table function call and parameters here )", but based on quick look at a Postgres manual it appears you omit the "TABLE( )" wrapper.
create table TODAYS_ORDER_PROCESSING_EXCEPTIONS as (select TODAYS_ORDER_BATCH.* ,VALIDATION_RESULT.ROW_VALID ,POST_RESULT.ROW_POSTED ,WAREHOUSE_RESULT.ROW_WAREHOUSEDfrom TODAYS_ORDER_BATCHcross join VALIDATE_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function] ) as VALIDATION_RESULT ( ROW_VALID ) --example: 1/0 true/false Boolean returnedleft join POST_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function] ) as POST_RESULT ( ROW_POSTED ) --example: 1/0 true/false Boolean returned on ROW_VALIDATED = '1'left join DATA_WAREHOUSE_TODAYS_ORDER_BATCH ( ORDER_NUMBER, [either pass the remainder of the order columns or fetch them in the function] ) as WAREHOUSE_RESULT ( ROW_WAREHOUSED ) --example: 1/0 true/false Boolean returned on ROW_POSTED = '1'where coalesce( ROW_VALID, '0' ) = '0' --Capture only exceptions and unprocessed work. or coalesce( ROW_POSTED, '0' ) = '0' --Or, you can flip the logic to capture only successful rows. or coalesce( ROW_WAREHOUSED, '0' ) = '0') with data
- If table TODAYS_ORDER_BATCH contains 1,000,000 rows, thenVALIDATE_TODAYS_ORDER_BATCH will be called 1,000,000 times, once foreach row.
- If 900,000 rows pass validation inside VALIDATE_TODAYS_ORDER_BATCH, then POST_TODAYS_ORDER_BATCH will be called 900,000 times.
- If only 850,000 rows successfully post, then VALIDATE_TODAYS_ORDER_BATCH needs some loopholes closed LOL, and DATA_WAREHOUSE_TODAYS_ORDER_BATCH will be called 850,000 times.
- If 850,000 rows successfully made it into the Data Warehouse (i.e. no additional exceptions were generated), then table TODAYS_ORDER_PROCESSING_EXCEPTIONS will be populated with 1,000,000 - 850,000 = 150,000 exception rows.
The table function calls in this example are only returning a single column, but they could be returning many columns. For example, the table function validating an order row could return the reason why an order failed validation.
In this design, virtually all the chatter between a HLL and the database is eliminated, since the HLL requestor is asking the database to process the whole batch in ONE request. This results in a reduction of millions of SQL requests to the database, in a HUGE removal of millions of HLL procedure or method calls, and as a result provides a HUGE runtime improvement. In contrast, legacy code which often processes a single row at a time, would typically send 1,000,000 fetch SQL requests, 1 for each row in TODAYS_ORDER_BATCH, plus at least 1,000,000 HLL and/or SQL requests for validation purposes, plus at least 1,000,000 HLL and/or SQL requests for posting purposes, plus 1,000,000 HLL and/or SQL requests for sending the order to the data warehouse. Granted, using this table function design, inside the table functions SQL requests are being sent to the database, but when the database makes requests to itself (i.e from inside a table function), the SQL requests are serviced much faster (especially in comparison to a legacy scenario where the HLL requestor is doing single row processing from a remote system, with the worst case over a WAN - OMG please don't do that).
You can easily run into performance problems if you use a table function to "fetch a result set" and then join that result set to other tables. In that case, the SQL optimizer can't predict what set of rows will be returned from the table function, and therefore it can't optimize the join to subsequent tables. For that reason, I rarely use them for fetching a result set, unless I know that result set will be a very small number of rows, hence not causing a performance problem, or I don't need to join to subsequent tables.
In my opinion, one reason why table functions are underutilized is that they are often perceived as only a tool to fetch a result set, which often performs poorly, so they get written off as a "poor" tool to use.
Table functions are immensely useful for pushing more functionality over to the server, for eliminating most of the chatter between the database server and programs on remote systems, and even for eliminating chatter between the database server and external programs on the same server. Even chatter between programs on the same server carries more overhead than many people realize, and much of it is unnecessary. The heart of the power of table functions lies in using them to perform actions inside result set processing.
There are more advanced design patterns for using table functions that build on the above pattern, where you can maximize result set processing even further, but this post is a lot for most to absorb already.