Kuro5hin.org: technology and culture, from the trenches
create account | help/FAQ | contact | links | search | IRC | site news
[ Everything | Diaries | Technology | Science | Culture | Politics | Media | News | Internet | Op-Ed | Fiction | Meta | MLP ]
We need your support: buy an ad | premium membership

[P]
Complex Data Structures in PL/SQL

By aziegler in Technology
Thu Nov 21, 2002 at 05:28:16 AM EST
Tags: Software (all tags)
Software

In an ideal software development project, the development environment (that is, the languages, tools, and libraries) will be chosen that best suits the nature of the problem domain and the skills of the professionals involved. In the real world, the development environment is dictated by any number of constraints. The most common is that the project is a modification to or enhancement of an existing project. Just because one cannot -- for whatever reason -- use the language most suited to the solution does not mean that one can't take lessons from those languages.

In a recent project I worked on, the code was written in PL/SQL called by a driver written in C. While a lot of developers deride PL/SQL (mostly because they don't know the language), it's a very useful, powerful, and expressive language. As useful as it is, PL/SQL is not without its limitations, mostly surrounding complex data structures. These limitations presented what seemed to be an insurmountable roadblock to the project, but the time-frame of this subproject did not allow for the language to be changed to one where the data structures required could be represented more `naturally.' This article demonstrates how I solved this problem. The techniques that I used -- both the solution and the apoproach to the solution -- are easily adaptable to other languages and problem domains.


A lot of developers, in my experience, don't think much of PL/SQL because they believe that it performs slowly, or because its functionality is limited. For this project, PL/SQL performed the tasks (mostly data manipulation) as fast or faster than an equivalent C implementation could have done because there were fewer context switches and no network communication involved. (Developers unfamiliar with PL/SQL can reference the original article's sidebar for a brief introduction. It is loosely based on Ada.)

The problem for which I had to develop a solution required that the time-series data be viewed in ways which are not supported by Oracle SQL queries on the original data; it was necessary to restructure the data. This turned out to be more of a problem than I had originally anticipated, and the eventual solution to the problem came from an interesting combination of techniques. In the end, I adapted data structures that are more naturally represented in languages like C, C++, and Java in PL/SQL.

The Shape of the Problem

The data was stored in the database as multiple records representing the data class, the starting and ending dates, and the value. Records could have any value (positive or negative) that spans any amount of time. Multiple entries of the same data class for overlapping time periods could be present. Data 1, below, shows how records for a five-day period might be represented.

Type : Value : Dates
ABC:1:06/01/01 - 06/01/01
DEF:15:06/01/01 - 06/05/01
ABC:3:06/01/01 - 06/03/01
GHI:10:06/02/01 - 06/03/01
Data 1: Sample Data

This particular program needed to divide and combine the records such that there was one record for each data class per day, containing the portions of values which occurred on that day. If a record crossed multiple days, the program needed to see it within the daily summary for each of the days in which it existed and only for the proportion of the record that existed in that day. That is, because record GHI spans two days, the program has to treat it as if it were two separate GHI records, each of value 5. Similarly, there will be a total of three ABC records; the first record will have value 2 because that is the total of the portions of ABC records on that date. Data 2 shows how the data needs to be reorganized on a daily basis.

Type : Value : Date
ABC:2:06/01/01
DEF:3:06/01/01
ABC:1:06/02/01
DEF:3:06/02/01
GHI:5:06/02/01
ABC:1:06/03/01
DEF:3:06/03/01
GHI:5:06/03/01
DEF:3:06/04/01
DEF:3:06/04/01
Data 2: Sample Data Converted into Daily Records

The ideal way to handle this problem would be to make it so that as each record is added to the original table, a trigger is fired that does the record division as noted in data 2 and adds the resulting records to a daily summary table. This solution has three major problems with it:

  • The source table from which the original records were pulled was a very high volume table (and significantly larger than the sample data shown). Performing the split on a per-insert basis is computationally expensive, as each data class was generally expected to have at least one record and more likely several records each day (note that the examples given in this article assume only a single data source).
  • The total volume of the data in the source table was very high, and the daily summary table would be at least as large, and most likely several times larger than the source table. Since these values are computed and are used in a single subsystem (this one) only, it didn't make sense to use that much storage for what was essentially a temporary table.
  • Each of the records in the data source table had a limited lifespan; once the analysis had been run on the record successfully, the analyzing program would never see it again. This would have increased the volume of the summary table unacceptably, because the summaries would have to be differentiated by visibility.

Without a temporary summary table, there was no way to do solve this problem with SQL queries (after all, the temporary summary table solved the problem of the data structure), so I needed to approach this programmatically. The data structure immediately suggested by data 2 is an array of daily records, each of which has a dynamically-resized array of item records. If I were using a C-like language, I might use data structures like example 1 to solve this.

Example 1: The C Array
typedef struct item_rec
{
char grouping[4];
double value;
} item_rec;

typedef struct day_rec
{
int date; /* an integer in yyymmdd format */
item_rec *item_list;
} day_rec;

day_rec *day_list;

The Limitations of the Language

This realization was very good, except for the fact that I couldn't rewrite the program in C or C++ for this project1. As stated previously, there was no time to explore that option. The answer had to be reached, quickly, with the tools available. The simple solution was not an option because the PL/SQL limitations around complex data structures. PL/SQL simply does not support nested collections. The array of structures containing arrays would not work in PL/SQL.

I looked through my library of PL/SQL books for a hint on how to solve this problem, but no one had publicly tackled this problem. There was, however, a hint toward a possible solution in Programming PL/SQL (2nd Edition) [Feuerstein 1998]. Steve Feuerstein has noted and lamented this limitation of PL/SQL, and he presented a section on how fixed-size multi-dimensional arrays could be simulated. He presented this as a package allowing the developer to predefine (almost "allocate&8221;) an array of size r × c complete with accessor functions. Because the implementation is hidden in the body of the package, it looks and works much like a two-dimensional array in other languages.

The technique presented was to use a PL/SQL INDEX BY BINARY_INTEGER memory table and map the two-dimensional array to this single table. The position of an element at [m, n] in an array of r × c elements are found in the index-by table with the forumula:

i := ((m × (c − 1)) + n)

That is, in a 10 × 10 array, position [7, 3] is at position 66 [= ((7 × (10 − 1)) + 3)] in a single-dimensional array of size 100. Similar calculations could be used for more than two dimensions, but it only works when you have fixed-size dimensions. There are other limitations best explained by Mr Feuerstein in his book.

As noted earlier, the problem to be solved had an indeterminate number of columns (dates) and rows (data class entries per day), so this technique could not offer a solution. It did, however, offer some ideas on how more complex data structures might be emulated in PL/SQL.

The Shape of the Solution

Part of the problem that I faced when originally implementing this was that I was letting my knowledge of PL/SQL and its limitations dictate where I was willing to look for solutions. I knew that because I couldn't do nested arrays, I didn't look as closely as I should have at first. However, PL/SQL is a general-purpose language where one can define new data types, so it's simply a matter of needing to adapt to the limitations. To consider how I might do this, I considered alternative options in C-like languages, knowing that I could able to adapt the solution to PL/SQL.

Example 1 represents the data as it might be implemented in C. It's a single array (day_list) where each entry in that array is linked to another array (item_list). Because it needed to be a dynamically resizable array, I would probably implement it as item_rec *item_list, a pointer to an item. If I allocated memory in increments of sizeof(item_rec), then I could able to take advantage of C pointer arithmetic and treat the pointer as if it were an array -- making it an implicitly linked list.

Most people don't consier an array to be a linked list because it doesn't act like one. While most arrays or vectors are implemented as contiguous memory blocks, this is not necessarily the case. Any given vector class could be implemented as a contiguous memory block or a linked list and the programmer using that vector class should never know the difference. The linking relationship in an array is based on proximity, not on an explicit link. If I modified the item_rec definition to include an explicit link to the next item in the array, as in example 2, then the linked-list relationship becomes explicit, allowing me to consider the implementation of the solution from a different angle.

Example 2: The C Linked-List
typedef struct item_rec
{
char grouping[4];
double value;
item_rec *next;
} item_rec;

typedef struct day_rec
{
int date; /* an integer in yyymmdd format */
item_rec *item_list;
} day_rec;

day_rec *day_list;

Further analysis of the problem revealed that this particular problem did not need the biggest strength of arrays: random access to the records stored in an array. I only needed to access the records in date order and then in the order in which they appeared in the database, so sequential access was sufficient, making it very clear that a linked-list was most likely to lead to the "correct&8221; implementation. It does not, however, reach the solution, becuase PL/SQL only has automatic memory management and references are only allowed to SQL cursor variables (in Oracle8i). Since there is no way to allocate memory programmatically like one would with malloc or new, I needed a modified approach for PL/SQL.

As it happens, PL/SQL has three different collection types, and only one of them is an array in the traditional sense (the varying array, VARRAY). Varying arrays and object tables are the newest additions to the PL/SQL language and are present to support the Oracle8i object-relational extensions. Both data types are very useful, but they are Oracle8i-specific features and are not available in Oracle7; they also require more effort and planning to use than the normal PL/SQL index-by tables. Both varying arrays and object tables are close in interface to arrays in other languages (the developer is responsible for extending the array to add values to the end of the array, and the indexes are always created in order).

The third type of collection in PL/SQL, the index-by table, was my only option in Oracle7, but index-by tables are still probably the optimal choice for this sort of problem. Index-by tables are sparse tables indexed by a binary integer value. A sparse table is rather like a set or a Perl hash (using integer keys): only one value may exist at any given index. Unlike varying arrays and object tables, index-by tables automatically allocate the space required for the variables in the collection, more like a vector class than a C array. Index values may be non-contiguous (e.g., 1, 2, 5, 7). If the keys are created contiguously, they look and act in many ways like a C array, but otherwise they allow for smart indexing just as a set or hash allows.

Because index-by tables can be non-contiguous, PL/SQL provides alternate means for navigating them. There are four methods associated with index-by tables: FIRST, NEXT, PREV, and LAST. Using FIRST or LAST will provide the index of the first or last entry in the table, respectively. NEXT and PREV are called with a parameter (e.g., day_list.NEXT(day_idx)) of the index for which you wish to find the next (or previous) index value. If there is nothing in the table, FIRST and LAST will return NULL; if there is no record that exists after (before) the requested index, NEXT (PREV) will return NULL. One other interesting behaviour in a non-contiguous sparse table is that the index value provided as a parameter for NEXT or PREV does not need to represent a value that actually exists. In this way, I can have an index-by table that has values at [5, 10, 15, 20] and NEXT(7) will return 10. The possibilities for this are extensive and exciting.

Using NEXT and PREV with an index-by table makes it appear to work much like a doubly linked list, so if I consider the binary integer type as if it were a pointer, I can interweave several linked lists within a single contiguous vector.

By mapping the linked list from example 2 to a vector structure, I get something that looks like example 3 in PL/SQL. The first index-by table (day_list) contains an index reference to the head of its item list, which is an entry in the second index-by table (item_list). The item records have index references to the next item in its chain, with a NULL value representing the end of the chain for that particular entry. In this way, I'm using the random-access nature of array indexing to my advantage while sequentially winding my way through my linked-list imposed on top of the index-by table.

Example 3: The PL/SQL Linked-Array
TYPE day_rec IS RECORD
(day_date DATE,
item_head BINARY_INTEGER);
TYPE day_tab IS TABLE OF day_rec
INDEX BY BINARY_INTEGER;

TYPE item_rec IS RECORD
(type_code VARCHAR(3),
amount NUMBER,
next_item BINARY_INTEGER,
prev_item BINARY_INTEGER);
TYPE item_tab IS TABLE OF item_rec
INDEX BY BINARY_INTEGER;

The lists in question must be treated as an inseparable pair. The "real&8221; lists must be declared in one place and one place only. In Oracle7 or Oracle8, then it is in your best interest to declare these as private package global (not public package global, unless you want the users of your package messing with these values and possibly messing up your linked list). You can declare these in a function and pass them around as a pair to other functions that use them, but you must remember that if one of them is IN OUT the other must be IN OUT as well. It's not recommended that you do this for large data sets that are transformed in this way unless you're using Oracle8i and can use the NOCOPY keyword to pass these tables by reference between functions and procedures.

The Solution

I have provided a sample implementation that reads a data value table (like table 1), parses the values according to the rules stated above (assuming a single data source) into the structures shown in example 3, and then writes the resulting data to a summary table (like table 2). In the real application for which this technique was used, the summary is kept only in memory while analysis is performed, and then only the results of the analysis are saved (in part because the summary is easily replicable). I have written the code such that a portion of the dates required can be returned in the summary table. The results are different if run for 2001-06-01--2001-06-05 and 2001-06-03--2001-06-04.

Listing 1 provides the specification for my package. I have only provided the one function as publicly visible because there is no need to expose the intimate operational details.

Listing 2 defines three different data types: data_value_tab, which is used exclusively for the storage of the incoming data in load_data; day_tab, which defines the daily list of values; and item_tab, which defines the list to which the linked list will be mapped. Special attention should be paid to how I use the index in day_tab later in the code: it illustrates the flexibility of the sparse index-by table.

I provide a simple wrapper procedure, pl, that wraps the Oracle built-in package DBMS_OUTPUT procedure PUT_LINE. This isn't absolutely required, and other debug mechanisms can be used as desired. Following pl, there are three more private functions that are called by the implementation of process_data: load_records, split_records, and save_records.

load_records reads the values from the DATA_VALUE table. Because I wanted this to be able to interpret partial values, I did not immediately start splitting the records into the data structures that I discovered solve the overall problem.

split_records is the meat of the program. After determining the number of data values, it gets the first and last date as Julian day values. This conversion is significant. It gives us the number of days between the two dates (last − first + 1) and the per diem amount (amount ÷ (last − first + 1). More importantly for the day_list, it gives us a smart index for use with our index-by. The index value represents the date on which the item_list will be used. Because we may get values out of date order or where there might be gaps in the date order that would be filled by division, it allows us to fill in the blanks as we need to, not as the code forces us to.

Thus, for each date in the duration of the data value, we check to see if it's in the requested boundaries. If it is, then we manage our item_list. If the date hasn't been added yet, we know that this partial value is going to be our first entry on the item_list and we process it as such. If we find that this date does exist, then we need to see if this particular type_code has already been added to the list -- and if it hasn't, we add the value to the end of the item_list and make the required links from the previous item in the list to the current item in the list. If the type_code already exists for the date in question, then we simply add onto the existing record. In this way, we obtain a structure in memory that looks substantially like example 3 and will allow us to produce the required summary data at will, though in a given order for the day or days in question. The key act of item creation is found here:

item_idx := NVL(item_list.LAST, 0) + 1;

From here, we get the last item that was created on the list (and if the list is empty, we get a NULL value that gets transformed to 0 by NVL) and add one to the item. In this way, we're adding to the item vector incrementally, but we're maintaining the relationship in the management of the prev_item and next_item values in the record we're dealing with.

Finally, save_records walks through the day_list records (which need not be contiguous, but in the sample data are contiguous) and then follows the item_list chain for each day, saving it to the database in the DATA_OUTPUT table. This table is the summary table that I noted would be prohibitive space-wise, and the number of records produced suggests this very clearly: ten output records are created in DATA_OUTPUT for four input records. This procedure demonstrates the navigation of both non-contiguous index-by tables and the linked list that I've placed on top of a contiguous index-by table.

Further Options

This technique could be made more programmatic by writing it in a package -- encapsulating the functionality so that it appears to be an object --but each type of list would require its own package, because the details of the data can't be generalized (as PL/SQL has no template support) even though the techniques used in this article are highly portable.

Lessons Learned

It's difficult, sometimes, to think outside of the language that you're working with. Reaching beyond the limitations of the language I was using allowed me to quickly solve the problem at hand without undertaking a change in language that would have had a prohibitive cost in time and effort. With languages like Java and C++, the option to import techniques from other languages is easier because the languages are easily extended to handle new data structures with user-defined data types, but as long as a language supports a minimum set of features (user-definable types and arrays that can be composed of user-definable types), this particular implementation could be carried to any programming language at all.

One of the reasons I try to stay aware of the capability and use of several programming languages is so that I have a variety of ways of looking at a problem. Because of this tendency, I was able to identify the way that I might solve this problem with the limited time available for the project. I intend to use this experience to remind me of the necessity of constantly looking elsewhere for the ideas to solve the problems at hand -- and creatively applying the ideas that I find with the capabilities of the language of the problem at hand.

Further Reading

If you're not familiar with the PL/SQL programming language and would like to read further, I heartily recommend any book by Steve Feuerstein and published by O'Reilly. In particular, Programming in PL/SQL (2nd Edition) and Oracle Built-In Packages are extremely valuable. They also have pocket guides that summarise the main points of usage that are useful.

Sponsors

Voxel dot net
o Managed Hosting
o VoxCAST Content Delivery
o Raw Infrastructure

Login

Related Links
o article's
o Listing 1
o Listing 2
o Also by aziegler


Display: Sort:
Complex Data Structures in PL/SQL | 33 comments (18 topical, 15 editorial, 0 hidden)
I've done some C++ to pl/sql interfacing (4.00 / 2) (#10)
by wiredog on Wed Nov 20, 2002 at 08:10:16 PM EST

Did you use the OCI library? I've found it quite useful.

More Math! Less Pr0n! K5 For K5ers!
--Rusty

No, just Pro*C/C++ (none / 0) (#14)
by aziegler on Wed Nov 20, 2002 at 08:32:42 PM EST

In considering the possibilities, I always found OCI too complex compared to the simplicity of:

EXEC SQL SELECT foo
  INTO :oFOO
  FROM bar;

where Oracle's precompiler does all the hard work.

The calls to PL/SQL are similar (EXEC SQL CALL foo(:b);, if I remember correctly; it's been a little while since I've dealt with Oracle).

With the right syntax highlighting, it becomes trivial to integrate the Oracle code in a simple manner.

-austin

[ Parent ]

Complexity is good (5.00 / 1) (#17)
by wiredog on Wed Nov 20, 2002 at 08:44:44 PM EST

Well, in my application. Parsing and loading XML files into the db. Had to make sure the links between elements were proper before doing the commit.

More Math! Less Pr0n! K5 For K5ers!
--Rusty

[ Parent ]
Use what's appropriate, of course... (none / 0) (#19)
by aziegler on Wed Nov 20, 2002 at 08:58:05 PM EST

In other applications, I might choose OCI, but in the applications that I've done, the application had legacy Pro*C/C++ code and OCI didn't offer any real benefit over Pro*C/C++ (there were a few times that it looked like it might, but there are techniques to take advantage of the SQLCA data structure). If I remember correctly, the biggest advantages of OCI over Pro*C/C++ are:
  1. Better support for the latest O/R features of Oracle8i and 9i, and:
  2. Easier multi-connection support (there's a SQLCA-like structure for each connection in OCI).
Despite this article and my general respect and like for PL/SQL, I'm leaning toward trying to do most projects outside of the database these days. (Like my "Bug Traction" program.)

-austin

[ Parent ]

API (4.00 / 1) (#20)
by kb5 on Wed Nov 20, 2002 at 09:30:30 PM EST

Why dont you create table APIs ? Isnt the simplest solution to replace the INSERT on the table into a PL/SQL API call ?

Too hard and/or limiting (5.00 / 1) (#21)
by ph317 on Wed Nov 20, 2002 at 09:41:36 PM EST


People have tried many times to wrap SQL in a C API, and it never works out as a general solution.  SQL is a rich language, and you end up either limiting your choices, or making the interface way to complicated, to the point that SQL was better.

OTOH, for specific applications and databases, it's a good idea to use such wrappers.  Make a library of SQL calls in your app, so that the rest of your code calls insert_user(char* name, char* x, char* y), and get_widgetdata(long widgetnum), instead of embedding SQL statements throughout all your code. Same applies in other languages besides C of course.  Chances are you'll use some of your SQL constructs several times in your code anyways, so you eliminate some redundacy to boot.

[ Parent ]

Absolutely! (none / 0) (#23)
by aziegler on Wed Nov 20, 2002 at 09:46:14 PM EST

Indeed, in most of the application that this program was a subsystem for, that's exactly what we did. This program was itself an API for most folks' thinking.

-austin

[ Parent ]

Not really. (4.00 / 1) (#22)
by aziegler on Wed Nov 20, 2002 at 09:44:27 PM EST

As I noted in the article, the data was of a temporary nature -- it was needed only once a month per customer. Each customer could have dozens of records of varying lengths and amounts, and the application was designed for hundreds of thousands of customers. Each main data record was ~500 bytes. For 100,000 customers, that's ~475 megabytes per month if each customer has ten records (not unheard of in this application). If, as was often the case, most of the records "spanned the month" (e.g., lasted ~30 days), we're talking about a simplified record (assume ~200 bytes) table of 5,722Gb per month (and that's a lot of money for a temporary table).

Thus, creating this data on insert into the main data table is wholly inappropriate. Under certain circumstances, a view (as suggested in an editorial comment) might work. As I looked at the problem, it was better to do a programmatic test.

I should note that the resulting code was very slow, but within acceptable performance limits. Using the same techniques outlined in this article, but reimplementing in C++ with proper vector classes, we gained significant performance and put the calculation responsibilities outside of the database,  where it's most appropriate. (PL/SQL is very good for a lot of things, but heavy duty calculations isn't one of them. This article was intended to show how flexible PL/SQL is and also how having a wide variety of languages under one's belt can help solve problems when your tools themselves are limited.)

-austin

[ Parent ]

I still don't get it (4.00 / 1) (#24)
by KWillets on Wed Nov 20, 2002 at 11:54:19 PM EST


Without a temporary summary table, there was no way to do solve this problem with SQL queries (after all, the temporary summary table solved the problem of the data structure), so I needed to approach this programmatically.

What's wrong with creating a date table, joining to the intervals with BETWEEN, and doing a GROUP BY to sum the averaged amounts?

create table dates(  dt date );

...insert a date for each day of interest...

select d.dt, t.type,
     sum( t.value / (t.end - t.start + 1))
from dates d, that_interval_table t
where d.dt between t.start and t.end
group by d.dt, t.type

Disclaimer:  the author is considered unqualified for employment as a database engineer, possibly due to producing too few lines of code per day.

I can answer this on several levels... (3.50 / 2) (#25)
by aziegler on Thu Nov 21, 2002 at 03:28:23 AM EST

Resource use: Computation time, on the data scales that we're talking about, is comparatively cheaper than disk space. Somewhat paradoxical, I know, but the data in the examples given is highly simplified, sitting around 21 bytes per row in size. The actual data was 250 - 500 bytes (there was a lot of data, and no, not all of it was actually useful, but there was only so much I could do to fix the application without breaking it). Of this, about 50 - 100 bytes was needed for the program that this example models. The actual data looked something like this:

  id_key: number(10)
  custnum: number(10)
  package: varchar2(10)
  service: varchar2(20)
  activity: varchar2(10)
  activity_date: date
  duration: number(10)
  description: varchar2(80)
  amount: number(10,2)
  bill_date: date

I'm missing about 8 other fields, but this gives a good idea. If the necessary temporary summary data were stored in its own table, it would require somewhere between 1.4 and 2.8 gigabytes of data for 100,000 customers on a monthly basis. (I know that this isn't what you suggested; bear with me, as I will address that specifically.) On the machines our customers were using, disks as small as 5 Gb were horrendously expensive from the customer's perspective (tightwads, I say, but they were spending a LOT of money on machines and our software, so ...)

Granularity: It gets quite a bit uglier when a few other specifics behind the original problem are reintroduced -- things I left out for simplicity's sake in explaining the technique and the solution. Not all records in the transaction table could be considered for inclusion -- some because the customer didn't have the necessary preconditions, some because the transactions themselves were illegal for consideration by this process.

I'm going to make a brief aside here -- if not all customers have the necessary preconditions for consideration by this process, why do I include them in the space calculations? Variability and performance. The preconditions could be met by a customer for multiple reasons in a given period, and the preconditions could be cleared for multiple reasons in a given period. Determining this on the fly (as my putative temporary summary table would theoretically need to do) would have prohibitive performance results on simple inserts into the transaction table (something that is supposed to be FAST). Thus, to keep a summary table, you just don't do lookup on this information -- you calculate it for everyone and let the process sort it out. It's still a performance hit, but not nearly as bad as having to do a lookup on every insert.
Back to granularity. The preconditions for the process were themselves user-definable, and a single account could have multiple preconditions. This mean that the process could be considering unbilled data +/- 1 year, which in part invalidates the instance table you suggested (I call it that because it records the "instance" of an event on a date). But that's not the end of problems with granularity.

I alluded to one of the problems in the article, but I deliberately did not expand upon it for clarity's sake. If I have a transaction which spans from 2002.11.02 12.00.00 - 2002.11.05 11.59.59, how do I calculate the daily values? It's only three days long, but it spans four calendar days. By taking a simplistic approach, I will get an incorrect proportion calculation -- 1, 1, 1, 1. This isn't valid, as the proper proportions are 0.5, 1, 1, 0.5. With the summary table, this is soluble. With the instance table you suggested, it's not soluble at all.

More critical than the proportional calculations (and it was critical) is the amount of data necessary to make either the instance table or the summary table workable. The instance table, having less information to store, is smaller, but less suited to the problem as a whole (as I've noted above).

Performance: The suggested solution also has a significant performance problem in that it requires that the database do the calculations. In almost all cases (excepting dates, and then only sometimes), it will be faster for a userland program to do grouping and calculations than a database query. PL/SQL programs fall into a funny category in that they're usually faster and more flexible than straight queries in this regard, especially since they can work from memory tables, but PL/SQL isn't really suited for calculations as opposed to data manipulation, so it will perform slower than external languages. I didn't do performance tests on an idea like yours at the time, but similar experience leads me to believe that it would perform 50% - 200% (or more) slower than the PL/SQL program. When we switched the program to C++ (using the same basic data structures without the hacks necessary to make it work in PL/SQL and with some other enhancements that I wanted in there), we got a performance boost of up to 100x.

There's an additional performance consideration here -- by processing the data programmatically, we're only dealing with a limited context. That is, we select the data for a particular customer. Therefore, we're dealing with roughly 10 - 30 kilobytes of data in the memory structures at a time, instead of a larger table on disk.

Both the instance table and the summary table have two other major problems which are closely related. The first problem is that of philosophy; the second problem is that of repeatability.

Philosophy: One of the things that I've learned as a database designer -- and one of the keys of normalization -- is that you don't generally store computed values. Aside from having an unnecessary duplication of data, you run the risk that the computed data will become out of sync with the actual data. Thus, in general, you don't want to keep a balance table, but rather a balance view which adds the credit and debit sides of the customer's balance. Performance considerations may push one toward a bit of denormalization in this matter, but that means that there's necessary programmatic updates which reduces the integrity of one's data. The summary table violates this far more than the instance table, but the instance table violates it because it repeats the date information already available in the transaction table.

Repeatability: This is actually the big one. By using the original data table to calculate the temporary data -- including date information -- the process is perfectly and infinitely repeatable with predictable results (same input results in same output). There's no chance of the input data becoming out of sync with the output data (although to be fair, the chance of sync errors is far greater with the summary table rather than the instance table because the calculable data is drawn from the transaction table).

Hopefully, that explains why I wasn't ultimately able to consider a table-based approach, and given an opportunity to do it again, would do it the same (although I would probably do the C++ first, and damn the time constraints).

-austin

[ Parent ]

I'm in a hurry and my head hurts (none / 0) (#31)
by KWillets on Thu Nov 21, 2002 at 06:02:20 PM EST


This mean that the process could be considering unbilled data +/- 1 year, which in part invalidates the instance table you suggested (I call it that because it records the "instance" of an event on a date). But that's not the end of problems with granularity.

Nope, the table I suggested records each "instance" of a day, between 365 and 366 per year.  The join adds the other keys.


I alluded to one of the problems in the article, but I deliberately did not expand upon it for clarity's sake. If I have a transaction which spans from 2002.11.02 12.00.00 - 2002.11.05 11.59.59, how do I calculate the daily values? It's only three days long, but it spans four calendar days. By taking a simplistic approach, I will get an incorrect proportion calculation -- 1, 1, 1, 1. This isn't valid, as the proper proportions are 0.5, 1, 1, 0.5. With the summary table, this is soluble. With the instance table you suggested, it's not soluble at all.


     sum( least( (d.dt-t.start)+1.0, t.end - d.dt, 1.0 )*t.value / (t.end - t.start))

Must...stop...coding...

[ Parent ]

Why I don't like PL/SQL (2.00 / 1) (#26)
by ggeens on Thu Nov 21, 2002 at 03:35:49 AM EST

Error reporting.

For the few PL/SQL scripts I wrote, I spent more time divining why Oracle didn't like my stored procedure than to actually make it work correctly.

But still, it has its place.

L'enfer, c'est les huîtres.


Oh yeah, it's the worst! (4.00 / 1) (#28)
by Hobbes on Thu Nov 21, 2002 at 04:09:54 AM EST

I forgot to put a closing parenthesis (i think that's what it was...) on a cursor I was writing, and oracle kept telling me that it was missing some completely different character in the same spot... I kept trying to figure out how I was missing this character that oracle reported, when the real problem was totally obvious once I ignored the stupid oracle error.

Ok, so maybe you had to be there. But I've found that more often than not the error that oracle reports is so wrong-headed that it throws me off and I have a harder time figuring out what is actually wrong.

Drives me batty.

Growr.
++++++
As bad as I am, I'm proud of the fact that I'm worse than I seem.
[ Parent ]

I've deduced ... (none / 0) (#30)
by Simon Kinahan on Thu Nov 21, 2002 at 11:22:50 AM EST

... that Oracle has two error messages. "missing right parenthesis", which actually means "syntax error of some indeterminate kind", and "missing left parenthesis", which means "missing right parenthesis".

Simon

If you disagree, post, don't moderate
[ Parent ]
Seems like a complex solution to a simple problem (none / 0) (#27)
by Hobbes on Thu Nov 21, 2002 at 03:53:07 AM EST

I don't know if I am just being thick skulled or not, but I swear I was faced with a similar problem at work not too long ago.

I wound up using a view which called a function. I don't know how far back that is supported though, as we are using 9.

I saw somebody else post a much less complicated looking SQL statement. Is there a reason you could not do something similar? I'm curious as to why you chose this solution, I guess.

Growr.

++++++
As bad as I am, I'm proud of the fact that I'm worse than I seem.

Model isn't the problem... (none / 0) (#29)
by aziegler on Thu Nov 21, 2002 at 10:01:01 AM EST

I've posted a couple of responses, but my most extensive was this one. It amounts to the fact that the model used here, in this article, isn't the whole problem. Some of the problem can't be revealed (even though I no longer work for the company in question), but the problem was significantly more complex than what is stated in the article.

[ Parent ]
Not a bad language, except for Booleans (none / 0) (#32)
by israfil on Mon Nov 25, 2002 at 09:37:16 AM EST

I don't mind PL/SQL, except that you can only define booleans within PL/SQL.  There is no exportable BOOLEAN type, or BIT, or BINARY(1), or equivalent that immediately and obviously translates into a single, indexable bit of boolean information.

It's so frustrating.  You have to use larger data structures to simulate it, or you have to use a char or something and do bit-shifting math to get your bit.  Great for space saving, but crap on the indexing. ;)

i.
i. - this sig provided by /dev/arandom and an infinite number of monkeys with keyboards.

clarification... (none / 0) (#33)
by israfil on Mon Nov 25, 2002 at 09:39:28 AM EST

I meant booleans that you can get as a result of a SQL call from another language.  You can use booleans internally in PL/SQL, you just can't return one from a stored-procedure, or have a BOOLEAN datatype as a column in a table.

Having said that, I've used a really good bit-shifting library that does nice things with integers and masking.

i.
i. - this sig provided by /dev/arandom and an infinite number of monkeys with keyboards.
[ Parent ]

Complex Data Structures in PL/SQL | 33 comments (18 topical, 15 editorial, 0 hidden)
Display: Sort:

kuro5hin.org

[XML]
All trademarks and copyrights on this page are owned by their respective companies. The Rest © 2000 - Present Kuro5hin.org Inc.
See our legalese page for copyright policies. Please also read our Privacy Policy.
Kuro5hin.org is powered by Free Software, including Apache, Perl, and Linux, The Scoop Engine that runs this site is freely available, under the terms of the GPL.
Need some help? Email help@kuro5hin.org.
My heart's the long stairs.

Powered by Scoop create account | help/FAQ | mission | links | search | IRC | YOU choose the stories!