aboutsummaryrefslogtreecommitdiff
path: root/fs/bcachefs/btree_write_buffer.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: silence silly kdoc warningKent Overstreet2024-07-181-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Eytzinger accumulation for accounting keysKent Overstreet2024-07-141-2/+52
| | | | | | | | | | | | | | | | | | | | | | | The btree write buffer takes as input keys from the journal, sorts them, deduplicates them, and flushes them back to the btree in sorted order. The disk space accounting rewrite is moving accounting to normal btree keys, with update (in this case deltas) accumulated in the write buffer and then flushed to the btree; but this is going to increase the number of keys handled by the write buffer by perhaps as much as a factor of 3x-5x. The overhead from copying around and sorting this many keys would cause a significant performance regression, but: there is huge locality in updates to accounting keys that we can take advantage of. Instead of appending accounting keys to the list of keys to be sorted, this patch adds an eytzinger search tree of recently seen accounting keys. We look up the accounting key in the eytzinger search tree and apply the delta directly, adding it if it doesn't exist, and periodically prune the eytzinger tree of unused entries. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree write buffer knows how to accumulate bch_accounting keysKent Overstreet2024-07-141-9/+75
| | | | | | | | | | | | Teach the btree write buffer how to accumulate accounting keys - instead of having the newer key overwrite the older key as we do with other updates, we need to add them together. Also, add a flag so that write buffer flush knows when journal replay is finished flushing accounting, and teach it to hold accounting keys until that flag is set. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_write_buffer_maybe_flush()Kent Overstreet2024-06-291-0/+37
| | | | | | | Add a new helper for checking references to write buffer btrees, where we need a flush before we definitively know we have an inconsistency. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: iter/update/trigger/str_hash flag cleanupKent Overstreet2024-05-081-4/+4
| | | | | | | | | | | Combine iter/update/trigger/str_hash flags into a single enum, and x-macroize them for a to_text() function later. These flags are all for a specific iter/key/update context, so it makes sense to group them together - iter/update/trigger flags were already given distinct bits, this cleans up and unifies that handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix btree node merging on write buffer btreesKent Overstreet2024-04-131-2/+12
| | | | | | | | The btree write buffer flush fastpath that avoids the main transaction commit path had the unfortunate side effect of not doing btree node merging. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix journal pins in btree write bufferKent Overstreet2024-03-311-0/+14
| | | | | | | | | | | | btree write buffer flush has two phases - in natural key order, which is more efficient but may fail - then in journal order The journal order flush was assuming that keys were still correctly ordered by journal sequence number - but due to coalescing by the previous phase, we need an additional sort. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve bch2_fatal_error()Kent Overstreet2024-03-181-1/+1
| | | | | | error messages should always include __func__ Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: jset_entry for loops declare loop iterKent Overstreet2024-03-131-2/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix journal_buf bitfield accessesKent Overstreet2024-03-131-0/+2
| | | | | | | All jounal_buf bitfield updates must happen under the journal lock - perhaps we should just switch these to atomic bit flags. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Prep work for variable size btree node buffersKent Overstreet2024-01-211-4/+3
| | | | | | | | | | | | | | | | | bcachefs btree nodes are big - typically 256k - and btree roots are pinned in memory. As we're now up to 18 btrees, we now have significant memory overhead in mostly empty btree roots. And in the future we're going to start enforcing that certain btree node boundaries exist, to solve lock contention issues - analagous to XFS's AGIs. Thus, we need to start allocating smaller btree node buffers when we can. This patch changes code that refers to the filesystem constant c->opts.btree_node_size to refer to the btree node buffer size - btree_buf_bytes() - where appropriate. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: __bch2_journal_key_to_wb -> bch2_journal_key_to_wb_slowpathKent Overstreet2024-01-051-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: wb_key_cmp -> wb_key_ref_cmpKent Overstreet2024-01-051-6/+6
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree_iter -> btree_path_idx_tKent Overstreet2024-01-011-8/+12
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_path_make_mut() -> btree_path_idx_tKent Overstreet2024-01-011-1/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: darray_for_each() now declares loop iterKent Overstreet2024-01-011-2/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Inline btree write buffer sortKent Overstreet2024-01-011-11/+82
| | | | | | | | | | The sort in the btree write buffer flush path is a very hot path, and it's particularly performance sensitive since it's single threaded and can block every other thread on a multithreaded write workload. It's well worth doing a sort with inlined cmp and swap functions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree write buffer now slurps keys from journalKent Overstreet2024-01-011-133/+305
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previosuly, the transaction commit path would have to add keys to the btree write buffer as a separate operation, requiring additional global synchronization. This patch introduces a new journal entry type, which indicates that the keys need to be copied into the btree write buffer prior to being written out. We switch the journal entry type back to JSET_ENTRY_btree_keys prior to write, so this is not an on disk format change. Flushing the btree write buffer may require pulling keys out of journal entries yet to be written, and quiescing outstanding journal reservations; we previously added journal->buf_lock for synchronization with the journal write path. We also can't put strict bounds on the number of keys in the journal destined for the write buffer, which means we might overflow the size of the preallocated buffer and have to reallocate - this introduces a potentially fatal memory allocation failure. This is something we'll have to watch for, if it becomes an issue in practice we can do additional mitigation. The transaction commit path no longer has to explicitly check if the write buffer is full and wait on flushing; this is another performance optimization. Instead, when the btree write buffer is close to full we change the journal watermark, so that only reservations for journal reclaim are allowed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: more write buffer refactoringKent Overstreet2024-01-011-40/+41
| | | | | | prep work for big rewrite - no functional changes in this patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: wb_flush_one_slowpath()Kent Overstreet2024-01-011-28/+29
| | | | | | | A bit of refactoring for better inlining in the main btree write buffer flush path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_write_buffer_flush() -> bch2_btree_write_buffer_tryflush()Kent Overstreet2024-01-011-1/+1
| | | | | | More accurate naming. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_write_buffer_flush_locked()Kent Overstreet2024-01-011-10/+17
| | | | | | | Minor refactoring - improved naming, and move the responsibility for flush_lock to the caller instead of having it be shared. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Clean up btree write buffer write ref handlingKent Overstreet2024-01-011-12/+26
| | | | | | | | | | | | | __bch2_btree_write_buffer_flush() now assumes a write ref is already held (as called by the transaction commit path); and the wrappers bch2_write_buffer_flush() and flush_sync() take an explicit write ref. This means internally the write buffer code can always use BTREE_INSERT_NOCHECK_RW, instead of in the previous code passing flags around and hoping the NOCHECK_RW flag was always carried around correctly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve btree write buffer tracepointsKent Overstreet2024-01-011-2/+6
| | | | | | | - add a tracepoint for write_buffer_flush_sync; this is expensive - fix the write_buffer_flush_slowpath tracepoint Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Rename BTREE_INSERT flagsKent Overstreet2024-01-011-9/+9
| | | | | | | BTREE_INSERT flags are actually transaction commit flags - rename them for clarity. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Avoiding dropping/retaking write locks in ↵Kent Overstreet2024-01-011-9/+7
| | | | | | bch2_btree_write_buffer_flush_one() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill BTREE_UPDATE_PREJOURNALKent Overstreet2024-01-011-4/+10
| | | | | | | | | | With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can now switch the btree write buffer to use it for flushing. This has the advantage that transaction commits don't need to take a journal reservation at all. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Journal pins must always have a flush_fnKent Overstreet2024-01-011-11/+7
| | | | | | | flush_fn is how we identify journal pins in debugfs - this is a debugging aid. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Heap allocate btree_transKent Overstreet2023-10-221-1/+1
| | | | | | | | | | We're using more stack than we'd like in a number of functions, and btree_trans is the biggest object that we stack allocate. But we have to do a heap allocatation to initialize it anyways, so there's no real downside to heap allocating the entire thing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix btree write buffer with snapshots btreesKent Overstreet2023-10-221-3/+6
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: use prejournaled key updates for write buffer flushesBrian Foster2023-10-221-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The write buffer mechanism journals keys twice in certain situations. A key is always journaled on write buffer insertion, and is potentially journaled again if a write buffer flush falls into either of the slow btree insert paths. This has shown to cause journal recovery ordering problems in the event of an untimely crash. For example, consider if a key is inserted into index 0 of a write buffer, the active write buffer switches to index 1, the key is deleted in index 1, and then index 0 is flushed. If the original key is rejournaled in the btree update from the index 0 flush, the (now deleted) key is journaled in a seq buffer ahead of the latest version of key (which was journaled when the key was deleted in index 1). If the fs crashes while this is still observable in the log, recovery sees the key from the btree update after the delete key from the write buffer insert, which is the incorrect order. This problem is occasionally reproduced by generic/388 and generally manifests as one or more backpointer entry inconsistencies. To avoid this problem, never rejournal write buffered key updates to the associated btree. Instead, use prejournaled key updates to pass the journal seq of the write buffer insert down to the btree insert, which updates the btree leaf pin to reflect the seq of the key. Note that tracking the seq is required instead of just using NOJOURNAL here because otherwise we lose protection of the write buffer pin when the buffer is flushed, which means the key can fall off the tail of the on-disk journal before the btree leaf is flushed and lead to similar recovery inconsistencies. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Add a race_fault() for write buffer slowpathKent Overstreet2023-10-221-0/+3
| | | | | | We haven't hooked up dynamic fault injection quite yet, but we will soon Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill BTREE_INSERT_USE_RESERVEKent Overstreet2023-10-221-2/+4
| | | | | | | Now that we have journal watermarks and alloc watermarks unified, BTREE_INSERT_USE_RESERVE is redundant and can be deleted. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill JOURNAL_WATERMARKKent Overstreet2023-10-221-1/+1
| | | | | | | This unifies JOURNAL_WATERMARK with BCH_WATERMARK; we're working towards specifying watermarks once in the transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Write buffer flush needs BTREE_INSERT_NOCHECK_RWKent Overstreet2023-10-221-0/+1
| | | | | | | | btree write buffer flush is only invoked from contexts that already hold a write ref, and checking if we're still RW could cause us to fail to completely flush the write buffer when shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: more aggressive fast path write buffer key flushingBrian Foster2023-10-221-21/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The btree write buffer flush code is prone to causing journal deadlock due to inefficient use and release of reservation space. Reservation is not pre-reserved for write buffered keys (as is done for key cache keys, for example), because the write buffer flush side uses a fast path that attempts insertion without need for any reservation at all. The write buffer flush attempts to deal with this by inserting keys using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on journal reservations that require blocking. Upon first error, it falls back to a slow path that inserts in journal order and supports moving the associated journal pin forward. The problem is that under pathological conditions (i.e. smaller log, larger write buffer and journal reservation pressure), we've seen instances where the fast path fails fairly quickly without having completed many insertions, and then the slow path is unable to push the journal pin forward enough to free up the space it needs to completely flush the buffer. This problem is occasionally reproduced by fstest generic/333. To avoid this problem, update the fast path algorithm to skip key inserts that fail due to inability to acquire needed journal reservation without immediately breaking out of the loop. Instead, insert as many keys as possible, zap the sequence numbers to mark them as processed, and then fall back to the slow path to process the remaining set in journal order. This reduces the amount of journal reservation that might be required to flush the entire buffer and increases the odds that the slow path is able to move the journal pin forward and free up space as keys are processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Private error codes: ENOMEMKent Overstreet2023-10-221-1/+1
| | | | | | | This adds private error codes for most (but not all) of our ENOMEM uses, which makes it easier to track down assorted allocation failures. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix for shared paths in write buffer flushKent Overstreet2023-10-221-0/+9
| | | | | | | | | It's possible for bch2_write_buffer_flush_one() to end up with a shared path, if called from a context that already has a btree iterator pointing to a key being flushed. We have to be careful when that happens, since we can't clone a path that holds write locks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: let __bch2_btree_insert() pass in flagsDaniel Hill2023-10-221-1/+1
| | | | | | | This patch is prep work for the following patch. Signed-off-by: Daniel Hill <daniel@gluo.nz> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Btree write bufferKent Overstreet2023-10-221-0/+330
This adds a new method of doing btree updates - a straight write buffer, implemented as a flat fixed size array. This is only useful when we don't need to read from the btree in order to do the update, and when reading is infrequent - perfect for the LRU btree. This will make LRU btree updates fast enough that we'll be able to use it for persistently indexing buckets by fragmentation, which will be a massive boost to copygc performance. Changes: - A new btree_insert_type enum, for btree_insert_entries. Specifies btree, btree key cache, or btree write buffer. - bch2_trans_update_buffered(): updates via the btree write buffer don't need a btree path, so we need a new update path. - Transaction commit path changes: The update to the btree write buffer both mutates global, and can fail if there isn't currently room. Therefore we do all write buffer updates in the transaction all at once, and also if it fails we have to revert filesystem usage counter changes. If there isn't room we flush the write buffer in the transaction commit error path and retry. - A new persistent option, for specifying the number of entries in the write buffer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>