aboutsummaryrefslogtreecommitdiff
path: root/fs/bcachefs/recovery.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* bcachefs: Fix replay_now_at() assertKent Overstreet2024-08-221-1/+7
| | | | | | | | Journal replay, in the slowpath where we insert keys in journal order, was inserting keys in the wrong order; keys from early repair come last. Reported-by: syzbot+2c4fcb257ce2b6a29d0e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: unlock_long() before resort in journal replayKent Overstreet2024-08-221-0/+1
| | | | | | Fix another SRCU splat - this one pretty harmless. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_insert() - add btree iter flagsAriel Miculas2024-07-141-1/+1
| | | | | | | | | | | | The commit 65bd44239727 ("bcachefs: bch2_btree_insert_trans() no longer specifies BTREE_ITER_cached") removes BTREE_ITER_cached from bch2_btree_insert_trans, which causes the update_inode function from bcachefs-tools to take a long time (~20s). Add an iter_flags parameter to bch2_btree_insert, so the users can specify iter update trigger flags, such as BTREE_ITER_cached. Signed-off-by: Ariel Miculas <ariel.miculas@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Convert gc to new accountingKent Overstreet2024-07-141-2/+1
| | | | | | | | | | Rewrite fsck/gc for the new accounting scheme. This adds a second set of in-memory accounting counters for gc to use; like with other parts of gc we run all trigger in TRIGGER_GC mode, then compare what we calculated to existing in-memory accounting at the end. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Delete journal-buf-sharded old style accountingKent Overstreet2024-07-141-19/+1
| | | | | | More deletion of dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill bch2_fs_usage_initialize()Kent Overstreet2024-07-141-2/+0
| | | | | | Deleting code for the old disk accounting scheme. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: dev_usage updated by new accountingKent Overstreet2024-07-141-17/+0
| | | | | | | | | | | | | Reading disk accounting now requires an eytzinger lookup (see: bch2_accounting_mem_read()), but the per-device counters are used frequently enough that we'd like to still be able to read them with just a percpu sum, as in the old code. This patch special cases the device counters; when we update in-memory accounting we also update the old style percpu counters if it's a deice counter update. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Disk space accounting rewriteKent Overstreet2024-07-141-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | Main part of the disk accounting rewrite. This is a wholesale rewrite of the existing disk space accounting, which relies on percepu counters that are sharded by journal buffer, and rolled up and added to each journal write. With the new scheme, every set of counters is a distinct key in the accounting btree; this fixes scaling limitations of the old scheme, where counters took up space in each journal entry and required multiple percpu counters. Now, in memory accounting requires a single set of percpu counters - not multiple for each in flight journal buffer - and in the future we'll probably also have counters that don't use in memory percpu counters, they're not strictly required. An accounting update is now a normal btree update, using the btree write buffer path. At transaction commit time, we apply accounting updates to the in memory counters, which are percpu counters indexed in an eytzinger tree by the accounting key. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: btree write buffer knows how to accumulate bch_accounting keysKent Overstreet2024-07-141-0/+3
| | | | | | | | | | | | Teach the btree write buffer how to accumulate accounting keys - instead of having the newer key overwrite the older key as we do with other updates, we need to add them together. Also, add a flag so that write buffer flush knows when journal replay is finished flushing accounting, and teach it to hold accounting keys until that flag is set. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Accumulate accounting keys in journal replayKent Overstreet2024-07-141-2/+70
| | | | | | | | | | | | | | | | | | | | | | | | Until accounting keys hit the btree, they are deltas, not new versions of the existing key; this means we have to teach journal replay to accumulate them. Additionally, the journal doesn't track precisely which entries have been flushed to the btree; it only tracks a range of entries that may possibly still need to be flushed. That means we need to compare accounting keys against the version in the btree and only flush updates that are newer. There's another wrinkle with the write buffer: if the write buffer starts flushing accounting keys before journal replay has finished flushing accounting keys, journal replay will see the version number from the new updates and updates from the journal will be lost. To avoid this, journal replay has to flush accounting keys first, and we'll be adding a flag so that write buffer flush knows to hold accounting keys until then. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: KEY_TYPE_accountingKent Overstreet2024-07-141-0/+1
| | | | | | | | | | | | | | New key type for the disk space accounting rewrite. - Holds a variable sized array of u64s (may be more than one for accounting e.g. compressed and uncompressed size, or buckets and sectors for a given data type) - Updates are deltas, not new versions of the key: this means updates to accounting can happen via the btree write buffer, which we'll be teaching to accumulate deltas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: metadata version bucket_stripe_sectorsKent Overstreet2024-07-141-0/+5
| | | | | | | | | | New on disk format version for bch_alloc->stripe_sectors and BCH_DATA_unstriped - accounting for unstriped data in stripe buckets. Upgrade/downgrade requires regenerating alloc info - but only if erasure coding is in use. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix a UAF after write_super()Kent Overstreet2024-06-211-2/+2
| | | | | | | | | write_super() may reallocate the superblock buffer - but bch_sb_field_ext was referencing it; don't use it after the write_super call. Reported-by: syzbot+8992fc10a192067b8d8a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Check for invalid btree IDsKent Overstreet2024-06-191-1/+7
| | | | | | | | We can only handle btree IDs up to 62, since the btree id (plus the type for interior btree nodes) has to fit ito a 64 bit bitmask - check for invalid ones to avoid invalid shifts later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Ensure we're RW before journallingKent Overstreet2024-05-221-1/+3
| | | | | Reported-by: syzbot+c60cd352aedb109528bf@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix shift overflow in btree_lost_data()Kent Overstreet2024-05-201-0/+3
| | | | | | Reported-by: syzbot+29f65db1a5fe427b5c56@syzkaller.appspotmail.com Fixes: 55936afe1107 ("bcachefs: Flag btrees with missing data") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: journal_replay_entry_early() checks for nonexistent deviceKent Overstreet2024-05-081-8/+11
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch_member.last_journal_bucketKent Overstreet2024-05-081-0/+2
| | | | | | | | | | | On recovery from clean shutdown we don't typically read the journal, but we still want to avoid overwriting existing entries in the journal for list_journal debugging. Thus, add some fields to the member info section so we can remember where we left off. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: journal seq blacklist gc no longer has to walk btreeKent Overstreet2024-05-081-4/+3
| | | | | | | | | | | Since btree_ptr_v2, we no longer require the journal seq blacklist table for skipping blacklisted bsets (btree node entries); the pointer to a given node indicates how much data is present. Therefore there's no longer any need for journal seq blacklist gc to walk the btree - we can prune entries older than journal last_seq. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_trans_unlock() must always be followed by relock() or begin()Kent Overstreet2024-05-081-1/+2
| | | | | | | | We're about to add new asserts for btree_trans locking consistency, and part of that requires that aren't using the btree_trans while it's unlocked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: member helper cleanupsKent Overstreet2024-05-081-1/+1
| | | | | | | | | | | | | | Some renaming for better consistency bch2_member_exists -> bch2_member_alive bch2_dev_exists -> bch2_member_exists bch2_dev_exsits2 -> bch2_dev_exists bch_dev_locked -> bch2_dev_locked bch_dev_bkey_exists -> bch2_dev_bkey_exists new helper - bch2_dev_safe Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: iter/update/trigger/str_hash flag cleanupKent Overstreet2024-05-081-7/+7
| | | | | | | | | | | Combine iter/update/trigger/str_hash flags into a single enum, and x-macroize them for a to_text() function later. These flags are all for a specific iter/key/update context, so it makes sense to group them together - iter/update/trigger flags were already given distinct bits, this cleans up and unifies that handling. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Finish converting reconstruct_alloc to errors_silentKent Overstreet2024-05-081-0/+11
| | | | | | | with errors_silent, reconstruct_alloc no longer requires fsck and fix_errors to work Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't read journal just for fsckKent Overstreet2024-05-081-1/+1
| | | | | | | reading the journal can take a decent amount of time compared to the rest of fsck, let's only read it when required. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Run upgrade/downgrade even in -o nochanges modeKent Overstreet2024-05-081-43/+41
| | | | | | We need to be able to test these paths in dry run mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: don't free error pointersKent Overstreet2024-05-061-1/+2
| | | | | Reported-by: syzbot+3333603f569fc2ef258c@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: make sure to release last journal pin in replayKent Overstreet2024-04-161-1/+4
| | | | | | | This fixes a deadlock when journal replay has many keys to insert that were from fsck, not the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't scan for btree nodes when we can reconstructKent Overstreet2024-04-091-14/+0
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Reconstruct missing snapshot nodesKent Overstreet2024-04-031-0/+1
| | | | | | | | When the snapshots btree is going, we'll have to delete huge amounts of data - unless we can reconstruct it by looking at the keys that refer to it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Flag btrees with missing dataKent Overstreet2024-04-031-0/+23
| | | | | | | We need this to know when we should attempt to reconstruct the snapshots btree Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Repair pass for scanning for btree nodesKent Overstreet2024-04-031-25/+29
| | | | | | | | | | | | | | | If a btree root or interior btree node goes bad, we're going to lose a lot of data, unless we can recover the nodes that it pointed to by scanning. Fortunately btree node headers are fully self describing, and additionally the magic number is xored with the filesytem UUID, so we can do so safely. This implements the scanning - next patch will rework topology repair to make use of the found nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()Kent Overstreet2024-04-031-2/+2
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch2_shoot_down_journal_keys()Kent Overstreet2024-04-031-10/+12
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Clear recovery_passes_required as they complete without errorsKent Overstreet2024-04-031-3/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Improve -o norecovery; opts.recovery_pass_limitKent Overstreet2024-03-311-6/+4
| | | | | | | | | | | | | | | | | | | | This adds opts.recovery_pass_limit, and redoes -o norecovery to make use of it; this fixes some issues with -o norecovery so it can be safely used for data recovery. Norecovery means "don't do journal replay"; it's an important data recovery tool when we're getting stuck in journal replay. When using it this way we need to make sure we don't free journal keys after startup, so we continue to overlay them: thus it needs to imply retain_recovery_info, as well as nochanges. recovery_pass_limit is an explicit option for telling recovery to exit after a specific recovery pass; this is a much cleaner way of implementing -o norecovery, as well as being a useful debug feature in its own right. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Ensure bch_sb_field_ext always existsKent Overstreet2024-03-311-17/+8
| | | | | | | | | This makes bch_sb_field_ext more consistent with the rest of -o nochanges - we don't want to be varying other codepaths based on -o nochanges, since it's used for testing in dry run mode; also fixes some potential null ptr derefs. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Flush journal immediately after replay if we did early repairKent Overstreet2024-03-311-0/+20
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Split out recovery_passes.cKent Overstreet2024-03-311-244/+5
| | | | | | | | | | | We've grown a fair amount of code for managing recovery passes; tracking which ones we're running, which ones need to be run, and flagging in the superblock which ones need to be run on the next recovery. So it's worth splitting out into its own file, this code is pretty different from the code in recovery.c. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Don't corrupt journal keys gap buffer when dropping alloc infoKent Overstreet2024-03-171-1/+5
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: reconstruct_alloc cleanupKent Overstreet2024-03-131-13/+38
| | | | | | | | | Now that we've got the errors_silent mechanism, we don't have to check if the reconstruct_alloc option is set all over the place. Also - users no longer have to explicitly select fsck and fix_errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: split out ignore_blacklisted, ignore_not_dirtyKent Overstreet2024-03-131-3/+4
| | | | | | prep work for replaying the journal backwards Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: improve move_gap()Kent Overstreet2024-03-131-2/+1
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: journal_keys now uses darray helpersKent Overstreet2024-03-131-6/+2
| | | | | | nice bit of code cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Rename journal_keys.d -> journal_keys.dataKent Overstreet2024-03-131-5/+5
| | | | | | This will let us use some darray helpers in the next patch. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Kill more -EIO error codesKent Overstreet2024-03-131-1/+1
| | | | | | | | This converts -EIOs related to btree node errors to private error codes, which will help with some ongoing debugging by giving us better error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix journal replay with unreadable btree rootsKent Overstreet2024-03-101-0/+11
| | | | | | | | | | | | When a btree root is unreadable, we still might be able to get some data back by replaying what's in the journal. Previously though, we got confused when journal replay would attempt to replay a key for a level that didn't exist. This adds bch2_btree_increase_depth(), so that journal replay can handle this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: journal_seq_blacklist_add() now handles entries being added out of ↵Kent Overstreet2024-03-101-1/+1
| | | | | | | | | | | | | | | order bch2_journal_seq_blacklist_add() was bugged when the new entry overlapped with multiple existing entries, and it also assumed new entries are being added in increasing order. This is true on any sane filesystem, but when trying to recover from very badly mangled filesystems we might end up with the journal sequence number rewinding vs. what the blacklist list knows about - easiest to just handle that here. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Fix check_version_upgrade()Kent Overstreet2024-02-131-5/+6
| | | | | | | | When also downgrading, check_version_upgrade() could pick a new version greater than the latest supported version. Fixes: Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: bch_fs_usage_baseKent Overstreet2024-01-211-1/+1
| | | | | | | Split out base filesystem usage into its own type; prep work for breaking up bch2_trans_fs_usage_apply(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* bcachefs: Restart recovery passes more reliablyKent Overstreet2024-01-051-1/+4
| | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>