skip to content
cruzluna.dev

CPU Bottleneck and SQLite Pooling: The Impact of Core Count

/ 5 min read

Intro

For some background, I am working on a side project where the tech stack is:

CategoryComponent
BackendAxum
Backendrusqlite
Backendsqlite
Frontendremix
FrontendshadCn
For now I am using sqlite as my database until I add MySQL and treat sqlite as the cache.

I’ve configured sqlite to use wal mode because of the following ref:

WAL provides more concurrency as readers do not block writers and a writer does not block readers. Reading and writing can proceed concurrently.

In my naive approach, I created a state like the following:

struct AppState {
    cache: Arc<Mutex<Cache>>,
}

And for each route I would lock for each read and write like the following:

pub async fn create_thing(
    State(state): State<AppState>,
    Json(thing): Json<CreateThingRequest>,
) -> Result<String, CreateThingError> {
  let cache = state.cache.lock().await;

  cache
      .insert_thing(thing.into())
      .map_err(|e| {})
}

// ......

pub async fn get_thing(
    State(state): State<AppState>,
    Path(id): Path<String>,
) -> Result<String, CreateThingError> {
let cache = state.cache.lock().await;

let thing = cache

    .get_thing(&id)
    .map_err(|e| {})
}

The issue with this naive approach of adding a lock for each read and write is that I am limiting one database operation at a time. Meaning if I have 10 requests for a read, they must be done one at a time and I’d assume sequentially.

This is suboptimal as I should allow for multiple simultaneous read operations as its not mutating data that could pose a conflict. And since I have wal mode enabled, I’d think I don’t need to impose this lock() pattern for each sqlite operation as unnecessary and from my interpretation of the docs the concurrency and blocking is handled on a database level. So adding another blocking mechanism on the application layer seems like overkill and will decrease the overall performance.

A quick solution I though would be to use a read-write lock on the rusqlite connection:

This way multiple readers can access the connection simultaneous and access to one writer.

struct AppState{
  conn: Arc<rusqlite::Connection>
}

fn main()-> Result<(),Box<dyn::std::Error>>{
    let state = AppState {
        conn: Arc<RwLock<rusqlite::Connection>>
    };
}

But you can’t because:

RefCell<rusqlite::inner_connection::InnerConnection> cannot be shared between threads safely within Connection, the trait Sync is not implemented for RefCell<rusqlite::inner_connection::InnerConnection> if you want to do aliasing and mutation between multiple threads, use std::sync::RwLock instead required for Arc<Connection> to implement Send

Connection Pool

After some quick searching, I found that I can use r2d2_sqlite crate, where I will have a pool of connections to enable parallelism with simultaneous requests. This should be good enough and I shouldn’t need a writer block because sqlite only allows for one writer (at least from what I skimmed online and in the sqlite docs) so adding a block is unnecessary. When I add MySQL, thats when I need to be concerned about multiple writers since it supports parallel inserts into the same table.

It was an easy add:

// main.rs
#[derive(Clone)]
struct AppState {
    cache: Arc<CacheConfig>,
}

// cache.rs

/*
https://github.com/brettwooldridge/HikariCP/wiki/About-Pool-Sizing
connections = ((core_count * 2) + effective_spindle_count)
*/
const MAX_CONNECTIONS: u32 = 4;
pub struct CacheConfig {
    pool: Pool<SqliteConnectionManager>,
}

impl CacheConfig {
    fn configure_pool(db_path: &str) -> Pool<SqliteConnectionManager> {
        let manager = SqliteConnectionManager::file(db_path);
        Pool::builder()
            .max_size(MAX_CONNECTIONS)
            .build(manager)
            .unwrap_or_else(|e| {
                error!("Failed to configure connection pool: {}", e);
                panic!("Pool not configured");
            })
    }
  // ...
}

And then switch all my queries using the connecting to get a pool connection and then execute:

-        stmt = self
-            .conn
+        stmt = pool
+            .get()?
+            .prepare(
+                "SELECT * FROM

Results

For benchmarking, I went with wrk, a simple CLI tool for http benchmarking. This approach is admittedly the easy way out, and likely not the best way to capture the performance of my sqlite database but my API is essentially a wrapper of writing and fetching from the database. We can still get some takeways if the pooled connections improved performance overall.

Before using sqlite pool on my 2 vcpu server in Hetzner:

❯ wrk -t4 -c100 -d10s \
  -H "accept: text/plain" \
  https://test-server.com/thing/0a27ea9d-964a-4eea-a1f9-3d0c049ad973

Running 10s test @ https://test-server.com/thing/0a27ea9d-964a-4eea-a1f9-3d0c049ad973
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    48.83ms   50.18ms 540.68ms   96.15%
    Req/Sec   612.14    123.49   800.00     81.72%
  23607 requests in 10.03s, 84.81MB read
Requests/sec:   2353.80
Transfer/sec:      8.46MB

Testing locally on my 2021 macbook pro sysctl -a | grep cpu

hw.activecpu: 8
❯ wrk -t4 -c100 -d10s \
  -H "accept: text/plain" \
  http://localhost:8080/thing/3ec05a57-ef02-4241-8dd5-12d6ccd37c5a

Running 10s test @ http://localhost:8080/thing/3ec05a57-ef02-4241-8dd5-12d6ccd37c5a
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.00ms    2.35ms  48.90ms   96.32%
    Req/Sec     2.30k   217.71     2.45k    92.25%
  91602 requests in 10.01s, 287.93MB read
Requests/sec:   9150.67
Transfer/sec:     28.76MB

After updating to pool of connections

2 vCPU Server with pool of 4 connections:

❯ wrk -t4 -c100 -d10s \
  -H "accept: text/plain" \
  https://test-server.com/thing/0a27ea9d-964a-4eea-a1f9-3d0c049ad973

Running 10s test @ https://test-server.com/thing/0a27ea9d-964a-4eea-a1f9-3d0c049ad973
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    57.08ms   72.66ms 632.56ms   95.54%
    Req/Sec   582.25    127.04   800.00     75.46%
  22172 requests in 10.09s, 79.64MB read
Requests/sec:   2197.00
Transfer/sec:      7.89MB

Local macbook pro with pool of 4:

❯ wrk -t4 -c100 -d10s \
  -H "accept: text/plain" \
  http://localhost:8080/thing/3ec05a57-ef02-4241-8dd5-12d6ccd37c5a
Running 10s test @ http://localhost:8080/thing/3ec05a57-ef02-4241-8dd5-12d6ccd37c5a
  4 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.02ms    8.28ms 120.77ms   95.03%
    Req/Sec     5.36k     1.64k    7.95k    75.50%
  213465 requests in 10.01s, 670.99MB read
Requests/sec:  21316.27
Transfer/sec:     67.00MB
EnvironmentDatabase SetupAvg LatencyRequests/secTransfer/sec
2 vCPU ServerBefore Pool48.83ms2353.808.46MB
Local Macbook ProBefore Pool11.00ms9150.6728.76MB
2 vCPU ServerAfter Pool (4)57.08ms2197.007.89MB
Local Macbook ProAfter Pool (4)6.02ms21316.2767.00MB

Conclusion

We see that using the pooled connections significantly improved the requests per second. From 9,150 requests/second to 21,316 requests/second

But we see that there was actually a regression in performance for the 2 vCPU server. From 2,353 requests/second to 2,197 requests/second.

At first I was surprised to see a regression, but when I checked the CPU utilization percentage when the tests were running, I observed a nearly 92% CPU utilization.

CPU Util % Dashboard This tells us that we are not benefiting from the extra parallelism because the CPU is the limiting factor of how much work can be done. Whereas my 8 core machine greatly benefits from having multiple connections.

It will be fun to experiment further, I’ll likely play with adding an in-memory cache, play with the WAL to see how to read it and use it, optimize the queries…