How to count rows read (scanned) in sqlite

Table of Contents

TLDR; Use "scanstats" to get the rows read (scanned)

I am using hosted sqlite service for Easyanalytica . These services bill by the number of rows read(scanned). I was curious to know how my queries are performing but i was surprised to find that this data is not provided by default in sqlite. In rest of the article i will cover the need for counting rows read, current data provided by sqlite and how to get rows scanned by a query in sqlite.

Need for counting row reads (scans) in sqlite

Managed database services like turso and Cloudflare d1 have become increasingly popular. They offer the simplicity of SQLite with the scalability of cloud infrastructure. These services often charge based on the number of rows read by your queries.

While these services provide built-in tools to measure row reads, using them incurs charges. You want to optimize your queries to reduce costs, but measuring them costs money. That's why it's essential to optimize queries locally before deploying them to these managed environments.

The cost implications of excessive row reads can be significant. Even seemingly simple queries can sometimes read far more rows than necessary, leading to inflated bills and reduced performance. This is why understanding and optimizing row reads is crucial for both performance and cost-effectiveness.

Problem in counting rows reads (scans) in sqlite locally

SQLite does come with built-in profiling features. For example, you can use the EXPLAIN QUERY PLAN command to get insights into how SQLite executes your queries. Here's an example:

EXPLAIN QUERY PLAN SELECT COUNT()
       FROM events
       WHERE orgid = '01'
       AND eventname != ''
       AND createdat >= '2024-06-20'
       AND createdat < '2024-07-05';

This will result in output like:

QUERY PLAN
`-- SCAN events USING COVERING INDEX eventOrgEventCreatedIdx

While this information is useful, it doesn't give us the specific number of rows read. It just tells us that index was used. SQLite's default profiling provides a high-level query plan, which is great for understanding the overall strategy SQLite uses to execute your query, but it falls short when we need detailed row read information.

The difference between this default profiling and actual row read counting is significant. The former gives you a general idea of how SQLite approaches your query, while the latter provides precise information about the query's efficiency in terms of data access.

How to Count Rows Read Locally

After some digging, I discovered the ".scanstats" option in SQLite. This feature allows us to get the exact number of rows read by our queries. However, it's not available in the standard SQLite build - you need a custom build with this option enabled.

Here's a step-by-step guide to enable the ".scanstats" option:

Clone this github repository sqlean shell
Modify the `Makefile` file and add -DSQLITE_ENABLE_STMT_SCANSTATUS to enable the sqlite3_stmt_scanstatus_v2() API.
Compile SQLite with this option enabled.
Use the newly compiled SQLite binary in your development environment.

This build contains additional extensions enabled as well. Once you have this custom build, you can use the ".scanstats on" command to enable row read counting. Here's an example of how it works:

Turn on the ".scanstats" option

sqlite> .scanstats on

Query:

SELECT COUNT()
       FROM events
       WHERE orgid = '01'
       AND eventname != ''
       AND createdat >= '2024-06-20'
       AND createdat < '2024-07-05';

Query Plan:

760
QUERY PLAN (cycles=7629278 [100%])
`--SEARCH events USING INDEX eventOrgEventCreatedIdx (orgId=?)
     (cycles=6550297 [86%] loops=1 rows=17962)

This output was initially baffling. This table contained data for a single orgid with single eventname so ideally if we are using index it should limit itself to the events falling in the createdat range given in the query. Despite giving specific conditions and using an index, it appeared that all rows of the table were being scanned. The query plan showed that 17,962 rows were read, even though the result was only 760.

This discrepancy led me to investigate further. The query plan provided by ".scanstats" was crucial here, as it gave both the row count and the information about index usage. If i would have just seen index i would not have noticed anything unusual. In case you are wondering how i knew it was using all the rows is because i already knew the number of rows in this table but the point still remains that without this combination i would not be able to track down the problem.

After some experimentation, I found that index was corrupt. So i recreated it. New query plan for the same query looked like this

Query:

SELECT COUNT()
       FROM events
       WHERE orgid = '01'
       AND eventname != ''
       AND createdat >= '2024-06-20'
       AND createdat < '2024-07-05';

Query Plan:

QUERY PLAN (cycles=3865743 [100%])
`--SEARCH events USING INDEX eventOrgEventCreatedIdx (orgId=?)
     (cycles=3622521 [94%] loops=1 rows=760)

This new query plan shows that only 760 rows were read - a massive improvement from the previous 17,962 rows for the same query.

Conclusion

Understanding and optimizing row reads in SQLite is crucial, especially when working with managed database services. While these services offer great scalability, their pricing models make it essential to optimize queries before deployment.

The ".scanstats" option in SQLite provides a powerful tool for this optimization process. By giving us precise information about row reads, it allows us to fine-tune our queries for maximum efficiency.

PS: In case you want to know how i used chatgpt/claude to optimize sql let me know i will write another article on it.