- Mar 23, 2010
-
-
Sage Weil authored
The wait_unsafe_requests() helper dropped the mdsc mutex to wait for each request to complete, and then examined r_node to get the next request after retaking the lock. But the request completion removes the request from the tree, so r_node was always undefined at this point. Since it's a small race, it usually led to a valid request, but not always. The result was an occasional crash in rb_next() while dereferencing node->rb_left. Fix this by clearing the rb_node when removing the request from the request tree, and not walking off into the weeds when we are done waiting for a request. Since the request we waited on will _always_ be out of the request tree, take a ref on the next request, in the hopes that it won't be. But if it is, it's ok: we can start over from the beginning (and traverse over older read requests again). Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
We were releasing used caps (e.g. FILE_CACHE) from encode_inode_release with MDS requests (e.g. setattr). We don't carry refs on most caps, so this code worked most of the time, but for setattr (utimes) we try to drop Fscr. This causes cap state to get slightly out of sync with reality, and may result in subsequent mds revoke messages getting ignored. Fix by only releasing unused caps. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Drop session mutex unconditionally in handle_cap_grant, and do the check_caps from the handle_cap_grant helper. This avoids using a magic return value. Also avoid using a flag variable in the IMPORT case and call check_caps at the appropriate point. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Passing a session pointer to ceph_check_caps() used to mean it would leave the session mutex locked. That wasn't always possible if it wasn't passed CHECK_CAPS_AUTHONLY. If could unlock the passed session and lock a differet session mutex, which was clearly wrong, and also emitted a warning when it a racing CPU retook it and we did an unlock from the wrong context. This was only a problem when there was more than one MDS. First, make ceph_check_caps unconditionally drop the session mutex, so that it is free to lock other sessions as needed. Then adjust the one caller that passes in a session (handle_cap_grant) accordingly. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
If we don't have the exported cap it's because we already released it. No need to WARN. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
This causes an oops when debug output is enabled and we kick an osd request with no current r_osd (sometime after an osd failure). Check the pointer before dereferencing. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Previously we would decode state directly into our current ticket_handler. This is problematic if for some reason we fail to decode, because we end up with half new state and half old state. We are probably already in bad shape if we get an update we can't decode, but we may as well be tidy anyway. Decode into new_* temporaries and update the ticket_handler only on success. Signed-off-by:
Sage Weil <sage@newdream.net>
-
- Mar 21, 2010
-
-
Sage Weil authored
Release the old ticket_blob buffer when we get an updated service ticket from the monitor. Previously these were getting leaked. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
The buffer size was incorrectly calculated for the ceph_x_encrypt() encapsulated ticket blob. Use a helper (with correct arithmetic) and BUG out if we were wrong. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
We were failing to reconnect to services due to an old authenticator, even though we had the new ticket, because we weren't properly retrying the connect handshake, because we were calling an old/incorrect helper that left in_base_pos incorrect. The result was a failure to reconnect to the OSD or MDS (with an authentication error) if the MDS restarted after the service had been up a few hours (long enough for the original authenticator to be invalid). This was only a problem if the AUTH_X authentication was enabled. Now that the 'negotiate' and 'connect' stages are fully separated, we should use the prepare_read_connect() helper instead, and remove the obsolete one. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
When an inode was dropped while being migrated between two MDSs, i_cap_exporting_issued was non-zero such that issue caps were non-zero and __ceph_is_any_caps(ci) was true. This prevented the inode from being removed from the snap realm, even as it was dropped from the cache. Fix this by dropping any residual i_snap_realm ref in destroy_inode. Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
All ci->i_snap_realm_item/realm->inodes_with_caps manipulation should be protected by realm->inodes_with_caps_lock. This bug would have only bit us in a rare race with a realm split (during some snap creations). Signed-off-by:
Sage Weil <sage@newdream.net>
-
Sage Weil authored
Added assertion, and cleared one case where the implemented caps were not following the issued caps. Signed-off-by:
Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by:
Sage Weil <sage@newdream.net>
-
- Mar 18, 2010
-
-
Chris Mason authored
This is used by the inode lookup ioctl to follow all the backrefs up to the subvol root. But the search being done would sometimes land one past the last item in the leaf instead of finding the backref. This changes the search to look for the highest possible backref and hop back one item. It also fixes a leaked path on failure to find the root. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
When a root id of 0 is sent to the inode lookup ioctl, it will use the root of the file we're ioctling and pass the root id back to userland along with the results. This allows userland to do searches based on that root later on. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
The search ioctl was skipping large items entirely (ones that are too big for the results buffer). This changes things to at least copy the item header so that we can send information about the item back to userland. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
The search ioctl was working well for finding tree roots, but using it for generic searches requires a few changes to how the keys are advanced. This treats the search control min fields for objectid, type and offset more like a key, where we drop the offset to zero once we bump the type, etc. The downside of this is that we are changing the min_type and min_offset fields during the search, and so the ioctl caller needs extra checks to make sure the keys in the result are the ones it wanted. This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make things more readable. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Akinobu Mita authored
Use bitmap_weight() instead of doing hweight32() for each u32 element in the page. Signed-off-by:
Akinobu Mita <akinobu.mita@gmail.com> Cc: Anton Altaparmakov <aia21@cantab.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Venkatesh Pallipadi authored
jffs2 uses rb_node = NULL; to zero rb_root. The problem with this is that 17d9ddc7 ("rbtree: Add support for augmented rbtrees") in the linux-next tree adds a new field to that struct which needs to be NULL as well. This patch uses RB_ROOT as the intializer so all of the relevant fields will be NULL'd. Signed-off-by:
Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Eric Paris <eparis@redhat.com> Acked-by:
David Woodhouse <dwmw2@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- Mar 16, 2010
-
-
Dave Chinner authored
If we are doing a forced shutdown, we can get lots of noise about delalloc pages being discarded. This is happens by design during a forced shutdown, so don't spam the logs with these messages. Signed-off-by:
Dave Chinner <dchinner@redhat.com> Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Alex Elder <aelder@sgi.com>
-
Alex Elder authored
Re-apply a commit that had been reverted due to regressions that have since been fixed. From 95f8e302 Mon Sep 17 00:00:00 2001 From: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:43:09 +1100 Implement XFS's large buffer support with the new vmap APIs. See the vmap rewrite (db64fe02) for some numbers. The biggest improvement that comes from using the new APIs is avoiding the global KVA allocation lock on every call. Signed-off-by:
Nick Piggin <npiggin@suse.de> Reviewed-by:
Christoph Hellwig <hch@infradead.org> Signed-off-by:
Lachlan McIlroy <lachlan@sgi.com> Only modifications here were a minor reformat, plus making the patch apply given the new use of xfs_buf_is_vmapped(). Modified-by:
Alex Elder <aelder@sgi.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Alex Elder <aelder@sgi.com>
-
Alex Elder authored
Re-apply a commit that had been reverted due to regressions that have since been fixed. Original commit: d2859751 Author: Nick Piggin <npiggin@suse.de> Date: Tue, 6 Jan 2009 14:40:44 +1100 XFS's vmap batching simply defers a number (up to 64) of vunmaps, and keeps track of them in a list. To purge the batch, it just goes through the list and calls vunamp on each one. This is pretty poor: a global TLB flush is generally still performed on each vunmap, with the most expensive parts of the operation being the broadcast IPIs and locking involved in the SMP callouts, and the locking involved in the vmap management -- none of these are avoided by just batching up the calls. I'm actually surprised it ever made much difference. (Now that the lazy vmap allocator is upstream, this description is not quite right, but the vunmap batching still doesn't seem to do much). Rip all this logic out of XFS completely. I will improve vmap performance and scalability directly in subsequent patch. Signed-off-by:
Nick Piggin <npiggin@suse.de> Reviewed-by:
Christoph Hellwig <hch@infradead.org> Signed-off-by:
Lachlan McIlroy <lachlan@sgi.com> The only change I made was to use the "new" xfs_buf_is_vmapped() function in a place it had been open-coded in the original. Modified-by:
Alex Elder <aelder@sgi.com> Signed-off-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Alex Elder <aelder@sgi.com>
-
Chris Mason authored
The space_info ioctl was using copy_to_user inside rcu_read_lock. This commit changes things to copy into a buffer first and then dump the result down to userland. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Sage Weil authored
Signed-off-by:
Sage Weil <sage@newdream.net> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Sage Weil authored
key->type is u8, not u64. fs/btrfs/ioctl.c: In function 'copy_to_sk': fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type Signed-off-by:
Sage Weil <sage@newdream.net> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
- Mar 15, 2010
-
-
NeilBrown authored
bdi_unregister is called by nfs_put_super which is only called by generic_shutdown_super if ->s_root is not NULL. So if we error out in a circumstance where we called nfs_bdi_register (i.e. server != NULL) but have not set s_root, then we need to call bdi_unregister explicitly in nfs_get_sb and various other *_get_sb() functions. Signed-off-by:
NeilBrown <neilb@suse.de> Signed-off-by:
Trond Myklebust <Trond.Myklebust@netapp.com>
-
Dan Carpenter authored
I fixed the indent level. Signed-off-by:
Dan Carpenter <error27@gmail.com> Signed-off-by:
Steve French <sfrench@us.ibm.com>
-
Nicholas Piggin authored
GFP_FS must be masked out, NOFS can't be or'd in. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
After callling submit_bio, the bio can be freed at any time. The btrfs submission thread helper was checking the bio flags too late, which might not give the correct answer. When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen. Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Xiao Guangrong authored
We can use btrfs_stack_device_id() to get dev_item->devid Signed-off-by:
Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Akinobu Mita authored
Use memparse() instead of its own private implementation. Signed-off-by:
Akinobu Mita <akinobu.mita@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
df is a very loaded question in btrfs. This gives us a way to get the per-space usage information so we can tell exactly what is in use where. This will help us figure out ENOSPC problems, and help users better understand where their disk space is going. Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
This patch just goes through and fixes everybody that does lock_extent() blah unlock_extent() to use lock_extent_bits() blah unlock_extent_cached() and pass around a extent_state so we only have to do the searches once per function. This gives me about a 3 mb/s boots on my random write test. I have not converted some things, like the relocation and ioctl's, since they aren't heavily used and the relocation stuff is in the middle of being re-written. I also changed the clear_extent_bit() to only unset the cached state if we are clearing EXTENT_LOCKED and related stuff, so we can do things like this lock_extent_bits() clear delalloc bits unlock_extent_cached() without losing our cached state. I tested this thoroughly and turned on LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out fine. Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
When finishing io we run btrfs_dec_test_ordered_pending, and then immediately run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that already, so we're searching twice when we don't have to. This patch lets us pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do complete io on that ordered extent we can just use the one we found then instead of having to do another btrfs_lookup_ordered_extent. This made my fio job with the other patch go from 24 mb/s to 29 mb/s. Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
This patch makes us cache the extent state we find in find_delalloc_range since we'll have to lock the extent later on in the function. This will keep us from re-searching for the rang when we try to lock the extent. Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
The ordered tree used to need a mutex, but currently all we use it for is to protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have less latency, 3445.138 ms instead of 3820.633 ms. Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Chris Mason authored
The endio is done at reverse order of bio vectors. That means for a sequential read, the page first submitted will finish last in a bio. Considering we will do checksum (making cache hot) for every page, this does introduce delay (and chance to squeeze cache used soon) for pages submitted at the begining. I don't observe obvious performance difference with below patch at my simple test, but seems more natural to finish read in the order they are submitted. Signed-off-by:
Shaohua Li <shaohua.li@intel.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Miao Xie authored
btrfs_mkdir() must jump to the place of ending transaction after btrfs_find_free_objectid() failed. Or this transaction can't end. Signed-off-by:
Miao Xie <miaox@cn.fujitsu.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Sage Weil authored
Flush any delalloc extents when we create a snapshot, so that recently written file data is always included in the snapshot. A later commit will add the ability to snapshot without the flush, but most people expect flushing. Signed-off-by:
Sage Weil <sage@newdream.net> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-
Josef Bacik authored
The way we report df usage is way confusing for everybody, including some other utilities (bacula for one). So this patch makes df a little bit more understandable. First we make used actually count the total amount of used space in all space info's. This will give us a real view of how much disk space is in use. Second, for blocks available, only count data space. This makes things like bacula work because it says 0 when you can no longer write anymore data to the disk. I think this is a nice compromise, since you will end up with something like the following [root@alpha ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup-lv_root 148G 30G 111G 21% / /dev/sda1 194M 116M 68M 64% /boot tmpfs 985M 12K 985M 1% /dev/shm /dev/mapper/VolGroup-LogVol02 145G 140G 0 100% /mnt/btrfs-test Compare this with btrfsctl -i output [root@alpha btrfs-progs-unstable]# ./btrfsctl -i /mnt/btrfs-test/ Metadata, DUP: total=4.62GB, used=2.46GB System, DUP: total=8.00MB, used=24.00KB Data: total=134.80GB, used=134.80GB Metadata: total=8.00MB, used=0.00 System: total=4.00MB, used=0.00 operation complete This way we show that there is no more data space to be used, but we have another 5GB of space left for metadata. Thanks, Signed-off-by:
Josef Bacik <josef@redhat.com> Signed-off-by:
Chris Mason <chris.mason@oracle.com>
-