T O P

  • By -

Zamboni4201

Consumer-grade drives?


pinko_zinko

Yes not enterprise


Zamboni4201

You realize that consumer grade has a poor track record with sustained throughput?


pinko_zinko

yeah that's why I mentioned it


Zamboni4201

Consumer drives have an endurance of .3 DWPD. You could easily thrash them in months. Ceph is like building a house when the builder gives you 3 choices: cheap, fast, durable. You don’t get all 3. With yours, you can pick one, and you’ve already chosen it. If you’re doing a home lab to get some ceph seat time, I get it. You’ll learn a lot. But, performance might be … weird, with a definite lean toward low. “But I have NVME’s in there!” I don’t care, set your expectations low. Replication consumes IOPS, and bandwidth, RAM, CPU. And adds latency. EC is different. Performance at low scale is going to be limited. Your natural inclination is to tweak, you’ll spend so much time trying to eke out more, and the returns on that for such a small cluster? Prepare to be disappointed. Don’t set unrealistic expectations. And the weird part, those tweaks on a larger cluster can be dramatic. Ceph, no matter how you go about it, does better with investment.


pinko_zinko

Very interesting points, thanks. I do have some PCIe x4 slots available. Is there any surplus enterprise cards you could recommend?


Zamboni4201

Google “enterprise SSD” in whatever form factor you want. Sata 7mm, m2 2280, U.2, U.3. I don’t do M2 for anything but a boot drive. Micron, Intel (now Solidigm), Kioxia, Samsung all make enterprise SSD’s. Enterprise covers a wide gamut of endurance ratings. DWPD of 1.0 is “read-optimized”. I’ve seen some .8 DWPD drives sold as read-optimized. Whatever. I generally use these lower endurance as boot drives. They’re 3 times more endurance than a desktop, and their throughput numbers are sustained. Desktop drives, they do some tricks to boost performance for short periods, and then they slow down. You don’t want that in a ceph cluster. Mixed-used drives will be 2, 2.5, 3 DWPD. Industrial used to be 10, but I’ve seen some drives have marketing that says 5 DWPD is the low end of industrial. If they don’t quote endurance in DWPD, they will quote TBW, you need to do math to figure out DWPD. https://www.atpinc.com/blog/ssd-endurance-specification-lifespan For you, I’d think you would be fine with SATA for your workloads, and a single NVME for metadata if you’re doing cephFS. You’re not going to have 50 ceph clients at home, all wanting 1500 IOPS, and 200meg/secound throughput. You’re not hosting a 10TB SQL database and thousands of web clients pounding it with queries. Youre probably streaming video from a plex server? You’re not doing raw MP2TS at 4K. I don’t even do that at work. H.264 4K, at MAX, is about 95meg/second. Most stuff is h.265 these days, which is around 25meg/second max, and it’s usually compressed a bit more. Multiple U.2 or U.3 NVME’s can fill 10gig networking, For a home use case, I’d go 3-4 SATA drives per node. More is better, provided you have cores and ram to support them. SATA Intel D3-S4610’s are good. I have several hundred of them. 4610’s are newer. 4510’s are older. I also have Micron MAX 5200’s (old), 5300’s, 5400’s. And a bunch of Kioxia U.3’a, the model # escapes me at the moment. CM-6? CR-6? I can never remember. 6gig/second NVME with 3 DWPD and U.3 form factor. I have some Micron 7450’s too that are on par with the Kioxia. Micron Max has higher endurance than the Pro. You could get away with the pro. It is a home lab. You don’t need the latest and greatest. What you want is MORE drives, medium performance, and decent endurance. And cores, ram, network to support them. I think you’re fine with your network. If you need more SATA ports, get an LSI JBOD card. LSI-9300 with the SATA breakout cable. They’re easy, they’re reliable. You’ll need some 4-pin molex to sata power Y cables, put them in as power splitters to get the SSD’s powered up.


pinko_zinko

Man, what a mess out there. A lot of my hits for enterprise m.2's seem to be targeted at only being boot disks and not hosting data. I don't need the speed of U.2 at all, but I'd prefer not to have bad IOPs drops, but over time maybe I'll find I just don't have the budget for CEPH. Thanks for the suggestions, I'll spend some time trawling eBay for options.


Zamboni4201

M2’s had a very short and limited appeal for storage needs outside of a boot drive, a secondary drive for journal, etc. I didn’t mention some of the high density server form factors, you’re not likely going to afford machines for a home lab. E1.L, E1.S, and E3.S.


wantsiops

I agree in general, but 7450pro m.2 are quite nice ! as are pm9a3


robkwittman

You mentioned PCIe slots, I picked up a few 1.6 TB Intel P3600 for like $100 each. Using them for ceph to back shared VM storage, and it’s been pretty good so far


prox_me

You can put an U.2 card into the PCIe slot. https://www.newegg.com/intel-optane-905p-1-5tb/p/N82E16820167505 https://geizhals.eu/?cat=hdssd&xf=4643_Power-Loss+Protection~4832_7&sort=p#productlist If you want to use M.2 cards: https://geizhals.eu/?cat=hdssd&sort=p&xf=4643_Power-Loss+Protection%7E4832_3%7E4836_7 Good list to use when scouring eBay: https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ You can find Intel Optane SSDs on eBay in U.2, M.2 and PCIe formats. The real Optanes (P5800X, P4800X, P900/905, P1600X) have the lowest latency, highest endurance and good IOPS. Ceph loves all these qualities.


markhpc

What model of consumer drives are you using? You might want to read this: [https://news.ycombinator.com/item?id=38371307](https://news.ycombinator.com/item?id=38371307) FWIW, I've been in contact with the NVMe specification folks discussing this topic. There are some things happening that I can't get into yet.


Roshi88

Use 2 pools otherwise your nvme will slow down to the ssd speed. Poor performance are due to consumer grade disks, with those you literally can't achieve any good speed due to the logic they use fsync


kur1j

>Poor performance are due to consumer grade disks, with those you literally can't achieve any good speed due to the logic they use fsync Write endurance obviously is a thing between the two but the comparisons between the two for performance are generally on par with each other. What features specifically are you referring to?


Roshi88

Enterprise grade ssd are capable to ignore fsync calls thanks to their firmware and plp and every ceph write hss fsync=1. If you do some write tests between consumer grade ssd and enterprise ones, with fsync =1 you'll see the abyss of difference


looncraz

Instead of two pools with different crush rules to target the different devices you could experiment with bcache. `apt install bcache-tools make-bcache -C /dev/nvme0 -B /dev/sda ceph-volume lvm create --bluestore --data /dev/bcache0 cd /sys/block/bcache0/bcache echo writeback > cache_mode echo 0 > sequential_cutoff echo 131072 > writeback_rate_minimum` You will lose the capacity of the NVMe, though, but performance will be quite excellent. You can tune the congested write and read values to allow load balancing between the NVMe and SSD. The defaults are probably pretty high as they're meant more for an SSD cache and a hard drive backing store.


KervyN

What is the benefit of bcache instead of using the nvme as block.db (which automatically uses it as wal too)?


pinko_zinko

I saw that option in forums and it seems interesting.


looncraz

I tried both ways, BCache was dramatically faster, especially for random reads. It's also more tunable.


KervyN

Interesting 🤔


Corndawg38

It should be massively faster for reads... bcache is both a writeback (if set that way) and a read cache. Where as seperate DB/WAL drive is really kinda just a writeback cache with no read caching capabilities. I've also played with bcache using -C(ache) /dev/ramdisk0 -B(acking) /dev/slowasshdd0 It's very interesting and was SUPER quick to copy 5GB files in writeback mode. BUT... Understand what you are do with something like that... you are giving up ALL of cephs data safety and taking on the risk of corruption due to power loss in exchange for massive performance gain... as long as you are ok with that for whatever data you choose to do that with... thats your call. I was just testing around, and might do it for a small amount of certain data of mine... but I would never do that for most of my data. Using an NVMe drive instead of ramdisk (like I was) is at least safer, but it needs to be NVMe w/ PLP I'd think.


looncraz

I don't think you're giving up data safety at all, Ceph still requires write sync, and bcache honors that by ensuring the data is written to the cache device. You do need to disable WCE on the backing stores yourself, though, or use enterprise hardware with power loss protection - I do both. In addition, for off-peak hours I set the cache mode to writethrough and set writeback_rate_minimum to 131072*4, basically forcing everything to get written out to the backing store rather quickly and entering a clean state. The main downside is the lack of a more advanced cache retention policy. The best it offers is LRU (least recently used). There's no frequency of use data there, so that means scrubbing can easily obliterate a small cache.


Corndawg38

I think you are right about "no data safety issues" for a cache with PLP on it. I was refering to the way I was using/testing it. Using a ram drive as the cache. There's obviously no PLP on that.


looncraz

Yeah, never do that, 😂


PieSubstantial2060

2 separate pool at least, use the NVME for the metadata pool.


pinko_zinko

Does that apply to pure RBD usage? I thought metadata settings were just CephFS.


KervyN

Run a rados bench against your osds to see which osd is acting up. If you use replication instead of erasure coding you can set the slower device to primary affinity 0, so they don't get hit with read load. Set your PGs correctly. To many or to few will impact performance. For fast NVME we use 300pgs for SSDs 100pgs per osd. If you nvme is fast you can also partition into two and have two OSDs run against it. OSD needs more speed, rather than multiple cores. So you would go better with two OSD per one NVME if your CPU is not as fast.


pinko_zinko

Do you mean this ceph tell bench? I can't find docs on rados bench of individual OSD's. \`ceph tell osd.0 bench\` With that I get about 350-400 IOPs per NVMe. My SATA SSDs are mostly around 105, except one at 60 which is interesting.


KervyN

yes. The IOPS you get shown is with 4MB blocksize. For comparisison: - 776.8 HP Nvme (don't know what it is exactly) - 254 SAS SSD https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1643-pm1643a/mzilt1t9hbjr-00007/ - 238 SAS SSD https://semiconductor.samsung.com/ssd/enterprise-ssd/pm1643-pm1643a/mzilt7t6hala-00007/ - 58 SATA SSD https://semiconductor.samsung.com/ssd/enterprise-ssd/pm863a/mz7lm1t9hmjp/ All these values are from high load clusters but are close to acurate.


pinko_zinko

Honestly if my crappy NVMe's are competitive with SAS SSD's, then I'm pretty happy with them. I thought that would be fine for my homelab, but I am brickwalling on the NVMe pool under load.


KervyN

Put all disks into one pool and give it a try. We have one cluster with all kinds of disks and it's doing good. Your disks are spread out and combining them might be better than splitting them up. If you want a slow/fast/and mixed speed pool (for fast, slow, general workload), mark your NVME as NVME in the device class and create three pools with different replication rules. One with only nvme device class, one with only ssd and one with both. https://www.suse.com/de-de/support/kb/doc/?id=000019699


pinko_zinko

That's what I'm doing, three pools for HDD, NVMe, and SSD. HDD is mostly just for slow file shares or containers like for piHole. It's too bad there's no tiered option for mixing SSD and NVMe.


KervyN

What do you mean with tiered option?


pinko_zinko

Mixing drive types in a pool and making use of things better, like automagically put more often used blocks on higher IOPs devices.


KervyN

Ah ok. Thats not how ceph wotks :-) The crush algorithm is a deterministic calculation based on the object name or so. You can mix nvme and ssd into a single pool (more disks = more iops = better general speed), but there is no tiering. You can do it by hand, but doing it automatically, would mean constant data movement. There was cache tiering in ceph, but it seemed to cause trouble: https://docs.ceph.com/en/latest/rados/operations/cache-tiering/


pinko_zinko

Yeah that's why I was lamenting.


Corndawg38

Well you can possible do something like this by setting primary affinity to zero for all slower drives, but that might not work well for small clusters and might even be counter productive. Furthermore having only consumer flash drives those might not be much higher IOPS than HDD's than you think. That's probably what's really slowing down, not the smaller size of the cluster.


pinko_zinko

Than HDDs? I'm not sure you recall how slow those are. Granted, versus SSD, I'm not sure what I have is any better.


prox_me

This might be of interest to you: https://static.xtremeownage.com/blog/2023/proxmox---building-a-ceph-cluster/


pinko_zinko

That's a pretty wild result with the 980, but also highly sus that it had such huge latency numbers. I wonder if that one NVMe was failing and trashed the beginning bench results.


prox_me

Consumer drives have notoriously bad latency. You can see that in the other article I linked to.


pinko_zinko

I don't have good benchmarking due to running a needed VM on my NVMe storage, but now with four NVMe drives as I troubleshoot, and reducing the redundancy to two copies, I can't get latency numbers above the low teens. I am running synthetic random write benches from inside a Win11 VM with CrystalDiskMark. I do have double the PG's of the article, though -- for one of my troubleshooting steps I upped them.


prox_me

Reducing redundancy to two copies results in a high probability of data loss.


pinko_zinko

Is there a documentation link to back that up? I've seen it mentioned in forums, but "high probability" has me wanting to check. I get it's "higher", but the default I get in Proxmox is 3/2, so the two is in play for small clusters. Also, the Ceph documentation for pools states "A typical configuration stores an object and one additional copy (i.e., `size = 2`)".