Thursday, May 10, 2007

What's faster? SQL or looping to return a set of records?

When you need to get a set of records (for example, to return a list of items from a lookup table, to prepare a report, or to fill a read-only grid), which method should you use?

Of course, the answer is 'it depends' and many factors are at play. It is part of our job as developers (or programmers, if you'd prefer) to pick the solution that best fits the scenario. As much has been written on this topic already, I won't be revealing anything new here: Just a re-package, if you will.

First, the basics. There are several ways to go through a table with the intention of producing a set of resulting records. You can SCAN, DO WHILE, use a FOR..NEXT loop, or use SQL syntax. The first three options are procedural and process one record at a time. The last (SQL) is set-based. To get an idea on what looping mechanism to use, take a look at Andy Kramek's blog entry from March 2006 called "Writing Better Code (Part 2)". In this entry, Andy does a great job of showing you the effects of indexes on the three looping mechanisms, and should give you a good idea which one to use and when.

Using SQL syntax, you can return a result of records as well. You can replace your looping syntax with SQL syntax and get the same results (using much less code). When gathering data for a report, for example, I have often used the loop method to populate a special report cursor. I was inclined to stick with the procedural approach in the past because (a) it was easy enough to write, (b) easy to debug, (c) allowed me to easily insert a custom progress bar if needed, and (d) because I thought it was best. As my SQL skills improved, I found that SQL syntax is actually (a) easier to write, (b) just as easy to debug, (c) capable of supporting a custom progress bar, and (d) may in fact be best under many circumstances. This is especially true if you are trying to write portable code (duh!).

To demonstrate my point, I first took a table with 125,000 records with no index (CDX) file. This first case will process each record in the table. My benchmarking methodology here would make anyone who has ever done benchmarking in a scientific way before blood boil, but hey, I never claimed to be a scientist (I don't even play one on TV)!

First, are we all on the same page?


SET ANSI OFF
SET EXACT OFF
SET TALK OFF
SET DELETED ON
CLEAR


Carry on...


USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
INSERT INTO crName VALUES (name_list.name)
SKIP
ENDDO
? SECONDS() - nSecs && results: 0.313 seconds (10 runs)
USE IN name_list

*-----

USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
GOTO nRec
INSERT INTO crName VALUES (name_list.name)
NEXT
? SECONDS() - nSecs && results: 0.281 seconds (10 runs)
USE IN name_list

*-----

USE name_list IN 1 EXCL
nSecs = SECONDS()
SELECT name FROM name_list INTO CURSOR crName
? SECONDS() - nSecs && results: 0.000 seconds (10 runs)
USE IN name_list


So what on earth is going on? The SQL Statement processes that entire result in (literally) no time at all! The answer can be found in a comment made in a recent blog posting of mine. Sergey explained that VFP doesn't actually run any SQL. All it does is create a filtered cursor on the underlying dataset. This is why RECCOUNT could be "wrong", as discussed in my blog entry. Check out that link to Sergey's site to read more about this. You can check what you got simply by issuing JUSTEXT(DBF()) (as Sergey does): "DBF" means a filtered cursor was created, while "TMP" means that a new cursor was created.

Additional Notes: When there is a CDX file (with no index set), the FOR loop seems to suffer a little: On average, the FOR loop takes an additional .020 seconds to run. The existence of a CDX file does not seem to bother the DO WHILE or SQL method in the above case.

Now, let's add a condition that the name was added to the table on or after 01/01/2005. The first run of this test is with no CDX file at all. If a CDX file exists, but no index set, then the results are a bit different (see notes after the code):


USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
IF created>={^2005-01-01}
INSERT INTO crName VALUES (name_list.name)
ENDIF
SKIP
ENDDO
? SECONDS() - nSecs && results: 0.234 seconds (10 runs)
USE IN name_list

*-----

USE name_list IN 1 EXCL
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
GOTO nRec
IF created>={^2005-01-01}
INSERT INTO crName VALUES (name_list.name)
ENDIF
NEXT
? SECONDS() - nSecs && results: 0.235 seconds (10 runs)
USE IN name_list

*-----

USE name_list IN 1 EXCL
nSecs = SECONDS()
SELECT name FROM name_list WHERE created>={^2005-01-01} INTO CURSOR crName
? SECONDS() - nSecs && results: 0.125 seconds (10 runs)
USE IN name_list


Here, with no CDX file, VFP seems to actually process the SQL (it doesn't create a filtered cursor), and does so rather fast. If you add an INDEX on created date, FoxPro goes ahead and creates the filtered cursor (which again is processed in zero seconds).

Additional Notes: When there is a CDX file (even with no index set), the FOR loop seems to suffer a little: On average, the FOR loop takes an additional .020 seconds to run. The existence of a CDX file does not seem to bother the DO WHILE statement.

OK, so now what if I actually use the INDEX, and SET ORDER TO before all of this?


USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
LOCATE
DO WHILE NOT EOF()
IF created>={^2005-01-01}
INSERT INTO crName VALUES (name_list.name)
ENDIF
SKIP
ENDDO
? SECONDS() - nSecs && results: 1.047 seconds (10 runs)
USE IN name_list

*-----

USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
CREATE CURSOR crName (name c(20))
SELECT name_list
FOR nRec = 1 TO RECCOUNT()
GOTO nRec
IF created>={^2005-01-01}
INSERT INTO crName VALUES (name_list.name)
ENDIF
NEXT
? SECONDS() - nSecs && results: 0.235 seconds (10 runs)
USE IN name_list

*-----

USE name_list IN 1 EXCL ORDER created
nSecs = SECONDS()
SELECT name FROM name_list WHERE created>={^2005-01-01} INTO CURSOR crName
? SECONDS() - nSecs && results: 0.000 seconds (10 runs)
USE IN name_list



The Do WHILE really gets hit hard, as you can see. The FOR LOOP works as
well as before, and the SQL once again gives us a filtered cursor.

As you can see, you can really take advantage of this behavior to create simple cursors with simple predicates. Even if VFP can't create the filtered cursor (for example, applying an ORDER BY clause on the SQL statements), the results are still usually faster.

In a subsequent posting, I'll increase the complexity of these tests. In the meantime, I would be interested in any feedback to these cases.

4 comments:

Anonymous said...

Why not use the NOFILTER clause on your SQL statements to allow the results to be compared?

Anonymous said...

Oh, and having the table ordered by an index can affect SQL performance *negatively*.. try with and without the ORDER clause in the USE statement (the index doesn't have to be active for Rushmore to use it in a SQL query)

Tod McKenna said...

Hello anonymous, and thanks for the feedback! Certainly NOFILTER in the SQL will make some difference and I'll add that to my next tests. The second suggestion is already accounted for in the examples (tried with and without the ORDER clause, and with and without the index). Unless I am misunderstanding?

Anonymous said...

I am sorry, but I just don't see why anyone would do a loop with an insert as compared to a select from x into y where z. The select has to be faster.