- Apr 05, 2014
-
-
Ilya Dryomov authored
Similar to osd weights, output primary affinity values on incremental osdmap updates. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com>
-
Ilya Dryomov authored
Reimplement ceph_calc_pg_primary() in terms of ceph_calc_pg_acting() and get rid of the now unused calc_pg_raw(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Respond to non-default primary_affinity values accordingly. (Primary affinity allows the admin to shift 'primary responsibility' away from specific osds, effectively shifting around the read side of the workload and whatever overhead is incurred by peering and writes by virtue of being the primary). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Change apply_temp() to override primary in the same way pg_temp overrides osd set. primary_temp overrides pg_temp primary too. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
In preparation for adding support for primary_temp, stop assuming primaryness: add a primary out parameter to ceph_calc_pg_acting() and change call sites accordingly. Primary is now specified separately from the order of osds in the set. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Switch ceph_calc_pg_acting() to new helpers: pg_to_raw_osds(), raw_to_up_osds() and apply_temps(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
apply_temp() helper for applying various temporary mappings (at this point only pg_temp mappings) to the up set, therefore transforming it into an acting set. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
pg_to_raw_osds() helper for computing a raw (crush) set, which can contain non-existant and down osds. raw_to_up_osds() helper for pruning non-existant and down osds from the raw set, therefore transforming it into an up set, and determining up primary. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add two helpers to decode primary_affinity (full map, vector<u32>) and new_primary_affinity (inc map, map<u32, u32>) and switch to them. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add primary_affinity infrastructure. primary_affinity values are stored in an max_osd-sized array, hanging off ceph_osdmap, similar to a osd_weight array. Introduce {get,set}_primary_affinity() helpers, primarily to return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to abstract out osd_primary_affinity array allocation and initialization. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add a common helper to decode both primary_temp (full map, map<pg_t, u32>) and new_primary_temp (inc map, same) and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Add primary_temp mappings infrastructure. struct ceph_pg_mapping is overloaded, primary_temp mappings are stored in an rb-tree, rooted at ceph_osdmap, in a manner similar to pg_temp mappings. Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'primary_temp <pgid> <osd>' per line, e.g: primary_temp 2.6 4 Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
In preparation for adding support for primary_temp mappings, generalize struct ceph_pg_mapping so it can hold mappings other than pg_temp. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Full and incremental osdmaps are structured identically and have identical headers. Add a helper to decode both "old" (16-bit version, v6) and "new" (8-bit struct_v+struct_compat+struct_len, v7) osdmap enconding headers and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Consolidate pg_temp (full map, map<pg_t, vector<u32>>) and new_pg_temp (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Use krealloc() instead of rolling our own. (krealloc() with a NULL first argument acts as a kmalloc()). Properly initalize the new array elements. This is needed to make future additions to osdmap easier. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Consolidate pools (full map, map<u64, pg_pool_t>) and new_pools (inc map, same) decoding logic into a common helper and switch to it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
To be in line with all the other osdmap decode helpers. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Sum up sizeof(...) results instead of (incorrectly) hard-coding the number of bytes, expressed in ints and longs. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Only version 6 of osdmap encoding is supported, anything other than version 6 results in an error and halts the decoding process. Checking if version is >= 5 is therefore bogus. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Follow osdmap_decode() and fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
The size of the memory area feeded to crush_decode() should be limited not only by osdmap end, but also by the crush map length. Also, drop unnecessary dout() (dout() in crush_decode() conveys the same info) and step past crush map only if it is decoded successfully. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Check length of osd_state, osd_weight and osd_addr arrays. They should all have exactly max_osd elements after the call to osdmap_set_max_osd(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
max_osd value is not covered by any ceph_decode_need(). Use a safe version of ceph_decode_* macro to decode it. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
The existing error handling scheme requires resetting err to -EINVAL prior to calling any ceph_decode_* macro. This is ugly and fragile, and there already are a few places where we would return 0 on error, due to a missing reset. Fix this by adding a special e_inval label to be used by all ceph_decode_* macros. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Split osdmap allocation and initialization into a separate function, ceph_osdmap_decode(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Dump osdmap in hex on both full and incremental decode errors, to make it easier to match the contents with error offset. dout() map epoch and max_osd value on success. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g: pg_temp 2.6 [2,3,4] Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
To save screen space in anticipation of more fields (e.g. primary affinity). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
To make it more readable and save screen space. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
This lets you adjust the vary_r tunable on a per-rule basis. Reflects ceph.git commit f944ccc20aee60a7d8da7e405ec75ad1cd449fac. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
The current crush_choose_firstn code will re-use the same 'r' value for the recursive call. That means that if we are hitting a collision or rejection for some reason (say, an OSD that is marked out) and need to retry, we will keep making the same (bad) choice in that recursive selection. Introduce a tunable that fixes that behavior by incorporating the parent 'r' value into the recursive starting point, so that a different path will be taken in subsequent placement attempts. Note that this was done from the get-go for the new crush_choose_indep algorithm. This was exposed by a user who was seeing PGs stuck in active+remapped after reweight-by-utilization because the up set mapped to a single OSD. Reflects ceph.git commit a8e6c9fbf88bad056dd05d3eb790e98a5e43451a. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
These two fields are misnomers; they are *retry* counts. Reflects ceph.git commit f17caba8ae0cad7b6f8f35e53e5f73b444696835. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Josh Durgin <josh.durgin@inktank.com>
-
Ilya Dryomov authored
Back in 27f4d1f6bc32c2ed7b2c5080cbd58b14df622607 we refactored the CRUSH code to allow adjustment of the retry counts on a per-pool basis. That commit had an off-by-one bug: the previous "tries" counter was a *retry* count, not a *try* count, but the new code was passing in 1 meaning there should be no retries. Fix the ftotal vs tries comparison to use < instead of <= to fix the problem. Note that the original code used <= here, which means the global "choose_total_tries" tunable is actually counting retries. Compensate for that by adding 1 in crush_do_rule when we pull the tunable into the local variable. This was noticed looking at output from a user provided osdmap. Unfortunately the map doesn't illustrate the change in mapping behavior and I haven't managed to construct one yet that does. Inspection of the crush debug output now aligns with prior versions, though. Reflects ceph.git commit 795704fd615f0b008dcc81aa088a859b2d075138. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Josh Durgin <josh.durgin@inktank.com>
-
Yan, Zheng authored
When there is no more data, ceph_msg_data_{pages,pagelist}_advance() should not move on to the next page. Signed-off-by:
Yan, Zheng <zheng.z.yan@intel.com>
-
- Apr 03, 2014
-
-
Ilya Dryomov authored
This is primarily for rbd's benefit and is supposed to combat fragmentation: "... knowing that rbd images have a 4m size, librbd can pass a hint that will let the osd do the xfs allocation size ioctl on new files so that they are allocated in 1m or 4m chunks. We've seen cases where users with rbd workloads have very high levels of fragmentation in xfs and this would mitigate that and probably have a pretty nice performance benefit." SETALLOCHINT is considered advisory, so our backwards compatibility mechanism here is to set FAILOK flag for all SETALLOCHINT ops. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
Encode ceph_osd_op::flags field so that it gets sent over the wire. Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Ilya Dryomov authored
With the addition of erasure coding support in the future, scratch variable-length array in crush_do_rule_ary() is going to grow to at least 200 bytes on average, on top of another 128 bytes consumed by rawosd/osd arrays in the call chain. Replace it with a buffer inside struct osdmap and a mutex. This shouldn't result in any contention, because all osd requests were already serialized by request_mutex at that point; the only unlocked caller was ceph_ioctl_get_dataloc(). Signed-off-by:
Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by:
Sage Weil <sage@inktank.com>
-
- Mar 28, 2014
-
-
Vlad Yasevich authored
Some drivers incorrectly assign vlan acceleration features to vlan_features thus causing issues for Q-in-Q vlan configurations. Warn the user of such cases. Signed-off-by:
Vlad Yasevich <vyasevic@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Vlad Yasevich authored
When the vlan filtering is enabled on the bridge, but the filter is not configured on the bridge device itself, running tcpdump on the bridge device will result in a an Oops with NULL pointer dereference. The reason is that br_pass_frame_up() will bypass the vlan check because promisc flag is set. It will then try to get the table pointer and process the packet based on the table. Since the table pointer is NULL, we oops. Catch this special condition in br_handle_vlan(). Reported-by:
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> CC: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by:
Vlad Yasevich <vyasevic@redhat.com> Acked-by:
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by:
David S. Miller <davem@davemloft.net>
-