1215 – Partitioning issues with 4k drives ("advanced format")

Bug 1215 - Partitioning issues with 4k drives ("advanced format")

Summary: Partitioning issues with 4k drives ("advanced format")

Status:	RESOLVED FIXED

Alias:	None

Product:	Mageia
Classification:	Unclassified
Component:	Installer (show other bugs)
Version:	Cauldron
Hardware:	All Linux

Priority:	Normal Severity: major
Target Milestone:	Mageia 3
Assignee:	Pascal Terjan
QA Contact:

URL:	https://ata.wiki.kernel.org/index.php...
Whiteboard:	(Mga2)
Keywords:	NEEDINFO

Duplicates (1):	2037 (view as bug list)
Depends on:
Blocks:	1994
	Show dependency tree / graph

Reported:	2011-05-08 18:10 CEST by Anssi Hannula
Modified:	2014-12-31 06:00 CET (History)
CC List:	10 users (show)

See Also:
Source RPM:	drakxtools
CVE:
Status comment:

Attachments
Quick fix to align partitions to start on 1MB boundaries (2.00 KB, patch) 2011-06-20 21:04 CEST, Anssi Hannula	Details \| Diff
WIP patch for non-512-byte logical sector sizes (34.10 KB, patch) 2011-06-20 21:11 CEST, Anssi Hannula	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Description Anssi Hannula 2011-05-08 18:10:47 CEST

Partitions created by the installer/diskdrake are not aligned properly on 4K drives, causing very bad performance ( https://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues ).

This issue has been fixed in parted/fdisk, but not in diskdrake as it does all the partitioning itself. It is quite unfortunate as other major OSes and distros have handled this correctly for quite some time now.

The "quick fix" that makes most cases work and should be done in any case is to align partitions to 1MB boundaries by default.

Additional lower-priority things to take into account are
- drives configured with a jumper that require off-by-one alignment of the
  partitions, and
- drives with logical non-512 sector sizes (mostly usb sticks as of now) where
  diskdrake currently creates too large partitions as it assumes sector size 512.

Related Mandriva bug reports:
https://qa.mandriva.com/show_bug.cgi?id=58071
https://qa.mandriva.com/show_bug.cgi?id=46774

Comment 1 Anssi Hannula 2011-06-20 21:04:25 CEST

Created attachment 594 [details]
Quick fix to align partitions to start on 1MB boundaries

Here's a small patch that changes diskdrake to align partitions to 1MB boundaries instead of involving head/cylinder boundaries. This is what other OSes (Linux/Windows) do, and fixes most cases where we don't align partitions properly.

Partition endings could probably also be optimized, but that is lower-priority because it doesn't affect performance like partition starting positions.

Pascal, could you maybe take a look in this one?

Comment 2 Anssi Hannula 2011-06-20 21:11:07 CEST

Created attachment 595 [details]
WIP patch for non-512-byte logical sector sizes

Full solution to all of the issues firstly requires us to handle non-512-bytes logical sector sizes (some USB drives apparently already exist according to the Mandriva bug reports).

Unfortunately it is not as easy as it seems; the 512-byte sector assumptions exist everywhere drakx handles partitions, and that is a lot of code.

I began adding support for those, but I didn't get it done yet, and I don't think I'm going to continue it anytime soon, so attached is a WIP patch in case someone who wants to do the work finds it useful.

Thierry Vignaud 2011-07-29 06:37:00 CEST

Keywords: (none) => PATCH
CC: (none) => thierry.vignaud
Assignee: bugsquad => pterjan

Comment 3 Thierry Vignaud 2011-07-29 06:39:24 CEST

For the first patch: Why don't you do the align on Mb part in adjustStart() ?
It would more clean & more readable

Comment 4 Anssi Hannula 2011-07-29 14:07:33 CEST

Because raw::adjustStart() is overridden by gpt.pm, mac.pm, sun.pm.

Comment 5 Thierry Vignaud 2011-07-29 16:29:00 CEST

Then why not just create some helper rather than inline it in adjustStartAndEnd()

Comment 6 Thierry Vignaud 2011-08-01 14:51:56 CEST

Ansi, can you just commit your first patch?
Thus we'll got basic test in cauldron.

Comment 7 Thierry Vignaud 2011-08-01 15:11:12 CEST

*** Bug 2037 has been marked as a duplicate of this bug. ***

CC: (none) => magnus.mud

Comment 8 Thierry Vignaud 2011-08-04 19:13:49 CEST

*** Bug 2037 has been marked as a duplicate of this bug. ***

Comment 9 Anssi Hannula 2011-08-04 20:36:41 CEST

OK, I can make it a separate function then. I'll probably apply it in SVN tomorrow when I'm less tired.

Sorry for the delay.

AL13N 2011-08-04 21:49:58 CEST

CC: (none) => maarten.vanraes
Blocks: (none) => 1994

Comment 10 Anssi Hannula 2011-08-07 00:34:21 CEST

Done, r1844.

Comment 11 Thierry Vignaud 2011-10-04 17:38:26 CEST

About the second patch, it would be better to rename functions whose behavior changes (such as MB()) in order to catch every caller.
But it might just be better to revice pixel/pterjan work for leveraging libparted (main issue was parted behaving like fdisk on partition tables it doesn't parse well, showing zeroed part table.

Comment 12 Marja Van Waes 2012-01-08 13:55:21 CET

pinging. because nothing happened to this report since more than 3 months ago, and it still has the status NEW or REOPENED


@ Pascal

Please set status to ASSIGNED if you think this bug was assigned correctly. If for work flow reasons you can't do that, then please put OK on the whiteboard instead.

CC: (none) => marja11

Comment 13 Manuel Hiebel 2012-03-08 16:25:56 CET

Hello this one is resolved no ?

Comment 14 Manuel Hiebel 2012-04-25 22:33:26 CEST

Ping ?

Comment 15 Pascal Terjan 2012-05-17 16:29:40 CEST

(In reply to comment #11)
> But it might just be better to revice pixel/pterjan work for leveraging
> libparted (main issue was parted behaving like fdisk on partition tables it
> doesn't parse well, showing zeroed part table.

Not only

From what I remember, error handling in libparted is awful (you need to set an exception handler which will get called for any kind of error with a translated error string, and parse that string). Default handler displaying the error on the console.

Comment 16 Marja Van Waes 2012-05-26 13:01:51 CEST

Hi,

This bug was filed against cauldron, but we do not have cauldron at the moment.

Please report whether this bug is still valid for Mageia 2.

Thanks :)

Cheers,
marja

Keywords: (none) => NEEDINFO

Comment 17 Marja Van Waes 2012-06-16 19:31:11 CEST

(In reply to comment #15)
> (In reply to comment #11)
> > But it might just be better to revice pixel/pterjan work for leveraging
> > libparted (main issue was parted behaving like fdisk on partition tables it
> > doesn't parse well, showing zeroed part table.
> 
> Not only
> 
> From what I remember, error handling in libparted is awful (you need to set an
> exception handler which will get called for any kind of error with a translated
> error string, and parse that string). Default handler displaying the error on
> the console.

that was on may 17th, so no chance this got fixed

Keywords: NEEDINFO => (none)
Whiteboard: (none) => (Mga2)

Comment 18 Marja Van Waes 2012-07-06 15:03:16 CEST

Please look at the bottom of this mail to see whether you're the assignee of this  bug, if you don't already know whether you are.


If you're the assignee:

We'd like to know for sure whether this bug was assigned correctly. Please change status to ASSIGNED if it is, or put OK on the whiteboard instead.

If you don't have a clue and don't see a way to find out, then please put NEEDHELP on the whiteboard.

Please assign back to Bug Squad or to the correct person to solve this bug if we were wrong to assign it to you, and explain why.

Thanks :)

**************************** 

@ the reporter and persons in the cc of this bug:

If you have any new information that wasn't given before (like this bug being valid for another version of Mageia, too, or it being solved) please tell us.

@ the reporter of this bug

If you didn't reply yet to a request for more information, please do so within two weeks from now.

Thanks all :-D

Comment 19 Richard Houser 2012-07-29 23:08:03 CEST

I just got a new SSD, and it is my understanding that the 25nm process chips like in the Vertex3 120GB line use a 2MB (rather than 1MB, like the quick-fix mentions) erase block size.  So, for me, I need alignment on 2MB boundaries.


LVM is already 4MB extents by default, so that's not an issue, but filesystems themselves also need to be tuned appropriately.

For ext4 (other modern FS work similarly), I need to use something like mkfs.ext4 -E stride=512 -E stripe-width=512 /dev/sdf1, but raid users need similar inputs to avoid hitting unnecessary disks.  I don't see how this could all be easily calculated for the raid case, so it might make more sense to allow setting these via an advanced tab or something.

CC: (none) => rick

Comment 20 Marja Van Waes 2012-08-16 07:46:55 CEST

Documentation team should write about this issue in the documentation about diskdrake (including in installer) if it didn't/doesn't get fixed.

Is there any news?

CC: (none) => doc-bugs
Target Milestone: --- => Mageia 3

Comment 21 Marja Van Waes 2012-12-14 11:06:43 CET

@ tmb

If this didn't get fixed yet, could gparted then be included on both LiveDVD's? (I didn't see it on the 3alpha3's)

Docteam started advising users who want to install on a SSD to use a tool like gparted for the partitioning part, and not all users will like a cli tool for that.

http://docteam.mageia.nl/installer/content/doPartitionDisks.html

Comment 22 Thierry Vignaud 2012-12-14 12:03:31 CET

BTW, I think this migh break windows booting after resizing (two people reported to me windows failed to boot after resizing it for installing Mageia)

Comment 23 Marja Van Waes 2012-12-14 16:38:20 CET

(In reply to comment #22)
> BTW, I think this might break windows booting after resizing (two people
> reported to me windows failed to boot after resizing it for installing Mageia)

Increasing severity

Should priority be increased, too?

Whiteboard: (Mga2) => 3alpha3 (Mga2)

Marja Van Waes 2012-12-28 16:16:11 CET

Whiteboard: 3alpha3 (Mga2) => 3beta1 (Mga2)

Comment 24 Marja Van Waes 2012-12-30 13:15:25 CET

Thanks to https://forums.mageia.org/en/viewtopic.php?f=15&t=4097&p=29407#p29407, I learned I misunderstood this  bug.

I had always thought that Anssi committed fixes, but that nothing was submitted because of some error handling issue.

Sorry about that.

Tbh, I don't understand what is left of this bug (some error handling issue - comment 15 - ...what impact does that have?)

And I don't understand tv's last comment either.

(In reply to comment #22)
> BTW, I think this migh break windows booting after resizing (two people
> reported to me windows failed to boot after resizing it for installing Mageia)

What is "this" ?

I don't have a clear view of the current situation. Is there anything documentation team should say in installer diskdrake help or MCC diskdrake help? 

If so, what?

Comment 25 AL13N 2012-12-30 20:16:07 CET

hmm, i was under the impression it was fixed...

Comment 26 Derek Jennings 2013-01-15 20:23:59 CET

Here is what Mageia 3 beta 2 did on my last install to SSD

# fdisk -l -u /dev/sda

Disk /dev/sda: 128.0 GB, 128035676160 bytes, 250069680 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0007007a

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048    39166469    19582211   83  Linux
/dev/sda2        39169998   250067789   105448896    5  Extended
/dev/sda5        39170048    48387779     4608866   83  Linux
/dev/sda6        48390144    56564864     4087360+  82  Linux swap / Solaris
/dev/sda7        56567808   250067789    96749991   83  Linux

All start sectors are divisible by 8 and so they are correctly aligned to 2M bounaries.
The exception is the extended partition, but I think that does not matter. Correct?

CC: (none) => derekjenn

Comment 27 Marja Van Waes 2013-01-16 08:54:16 CET

Thanks, Derek :)

So the bug is fixed in cauldron (unless that exception for the extended partition /does/ matter), only maybe not for windows resizing.

If it is reported again, it would be good to have the output of
# fdisk -l -u /dev/sda
for two partitioned SSD disks: one where the resizing was done from within windows, and the other where it was done by diskdrake, and where starting windows failed.

Keywords: PATCH => NEEDINFO
Summary: Partitioning issues with 4k drives ("advanced format") => Resizing Windows partition on SSD might lead to unbootable windows.

Comment 28 Derek Jennings 2013-01-16 12:44:00 CET

Correction to comment 26

All the partitions  start sector are divisible by 2048 so the partitions are aligned on 1MB boundaries

This is OK for most but not all SSD's  See Comment 19

Comment 29 Marja Van Waes 2013-01-16 13:04:23 CET

(In reply to comment #28)
> Correction to comment 26
> 
> All the partitions  start sector are divisible by 2048 so the partitions are
> aligned on 1MB boundaries
> 
> This is OK for most but not all SSD's  See Comment 19

1024 KiB = 1 MiB
2048 KiB = 2 MiB

I'd say that is 2 MiB boundaries, What do I miss?
There is more then 2MiB between partitions (only sda5 starts close to the beginning of the Extended partition it is in)

About windows, I just heard that Windows puts its Master File Table at the end of the windows partition. 
So if that partition is made smaller to free up space for Mageia, the MTF should somehow be moved to the new Windows partition end :/

Comment 30 Marja Van Waes 2013-01-16 13:05:58 CET

OOPS

You're right, it is not KiB, but only half a KiB

Thierry Vignaud 2013-01-16 21:20:18 CET

CC: (none) => anssi.hannula

Comment 31 Marja Van Waes 2013-01-16 21:56:58 CET

@ Anssi,

Sorry for making a mess of this bug (I don't feel too guilty, though, because I had asked for clarification in comment 24 :Ã¾ )

Feel free to revert the change I made to the summary of this bug, if that is better.

And about being clear: in comment 30 I talked about sector size.

Comment 32 Dave Hodgins 2013-04-05 04:49:08 CEST

(In reply to Richard Houser from comment #19)
> I just got a new SSD, and it is my understanding that the 25nm process chips
> like in the Vertex3 120GB line use a 2MB (rather than 1MB, like the
> quick-fix mentions) erase block size.  So, for me, I need alignment on 2MB
> boundaries.

Do you have a link to any documentation confirming that it's using
a 2MB erase block size?

All documentation I've found for ssd drives indicate they use from
16 to 512 kb erase blocks.  Newer hard drives are using 4kb sector
physical sector sizes.

The 1MB alignment was chosen, as all of the erase block sizes, and
512 byte, and 4kb sectors will all fit an exact number in a 1MB area.

CC: (none) => davidwhodgins

Comment 33 Richard Houser 2013-04-06 00:36:27 CEST

(In reply to Dave Hodgins from comment #32)
> (In reply to Richard Houser from comment #19)
> > I just got a new SSD, and it is my understanding that the 25nm process chips
> > like in the Vertex3 120GB line use a 2MB (rather than 1MB, like the
> > quick-fix mentions) erase block size.  So, for me, I need alignment on 2MB
> > boundaries.
> 
> Do you have a link to any documentation confirming that it's using
> a 2MB erase block size?
> 
> All documentation I've found for ssd drives indicate they use from
> 16 to 512 kb erase blocks.  Newer hard drives are using 4kb sector
> physical sector sizes.
> 
> The 1MB alignment was chosen, as all of the erase block sizes, and
> 512 byte, and 4kb sectors will all fit an exact number in a 1MB area.


Some manufacturers seem to be very tight-lipped regarding SSD erase block sizes, but here are a couple examples with the OCZ hardware.  I actually have the Agility3 unit as opposed to the Vertex3, btw.

http://superuser.com/questions/492084/is-alignment-to-erase-block-size-needed-for-modern-ssds

http://www.ocztechnologyforum.com/forum/showthread.php?95819-Block-Sizes-Vertex-3

Comment 34 Dave Hodgins 2013-04-07 00:37:59 CEST

I have an OCZ-AGILITY4.  hdparm -i shows
Logical  Sector size:                   512 bytes
Physical Sector size:                   512 bytes

The forum posts seem to be mixing up 2kb, 2mb, 2048 sectors (which
is 1MB), etc.

I'd really like to see an actual specification sheet confirming that
it's using a 2MB erase block size.

If the drive really does need 2MB alignment, anyone using gparted,
diskdrake, or windows 7, with the defaults, will have problems, as they
all use 1MB alignment by default.

Comment 35 Dave Hodgins 2013-04-07 01:13:23 CEST

(In reply to Richard Houser from comment #33)
> Some manufacturers seem to be very tight-lipped regarding SSD erase block
> sizes, but here are a couple examples with the OCZ hardware.  I actually
> have the Agility3 unit as opposed to the Vertex3, btw.

Can you run a test on it?

Create a partition starting at sector 2048 (aka 1MB aligment), and a second
partition starting at a sector that is a multiple of 4096 (aka 2MB alignment),

Create a test file in ram with
dd if=/dev/urandom of=/dev/shm/test.data bs=1M count=64

Then test the write speed with
time dd if=/dev/shm/test.data of=/path/to/partition/test.data
for both partitions and report the results here.

Comment 36 Richard Houser 2013-04-07 02:36:26 CEST

(In reply to Dave Hodgins from comment #34)
> I have an OCZ-AGILITY4.  hdparm -i shows
> Logical  Sector size:                   512 bytes
> Physical Sector size:                   512 bytes
> 
> The forum posts seem to be mixing up 2kb, 2mb, 2048 sectors (which
> is 1MB), etc.
> 
> I'd really like to see an actual specification sheet confirming that
> it's using a 2MB erase block size.
> 
> If the drive really does need 2MB alignment, anyone using gparted,
> diskdrake, or windows 7, with the defaults, will have problems, as they
> all use 1MB alignment by default.

Unfortunately, you can't trust the sector information coming from hdparm.  So many drives have been configured to lie for the sake of Windows XP, and someone had the bright idea to misreport the physical size, too.  For example, I have five different models of western digital hard drives with 4Kb physical sectors, and all but two report 512b to the machine.  I don't see why SSDs would be different.

I've tried a few times to get the official erase block size from OCZ, but their reps just gave me some BS about it being a trade secret, etc.

Comment 37 Richard Houser 2013-04-07 03:37:02 CEST

(In reply to Dave Hodgins from comment #35)
> (In reply to Richard Houser from comment #33)
> > Some manufacturers seem to be very tight-lipped regarding SSD erase block
> > sizes, but here are a couple examples with the OCZ hardware.  I actually
> > have the Agility3 unit as opposed to the Vertex3, btw.
> 
> Can you run a test on it?
> 
> Create a partition starting at sector 2048 (aka 1MB aligment), and a second
> partition starting at a sector that is a multiple of 4096 (aka 2MB
> alignment),
> 
> Create a test file in ram with
> dd if=/dev/urandom of=/dev/shm/test.data bs=1M count=64
> 
> Then test the write speed with
> time dd if=/dev/shm/test.data of=/path/to/partition/test.data
> for both partitions and report the results here.

I don't think that would accomplish anything.  Unlike a hard disk, there is the additional erase process to deal with.  So, normal writes happen at a page level, and erases happen at a higher level.  In this case, there seems to be an online consensus that the erase block for this SSD generation is 256 pages.

As long as the drive (not just the OS) has free space available, writes should happen at optimal speed anytime the write matches an even multiple of the page size.  So, 4KB writes (for example) should happen at the same speed in either partition.

To empirically test this (something I'm not prepared to do at the moment), I think this would work a bit better:

1.) Zero the entire drive.
2.) Pass the appropriate TRIM commands to the drive to let the hardware physically erase the data.  This should take a long time to run.
3.) Create the partitions you specified.
4.) Format each with the same filesystem supporting TRIM (ext4, for example)  However, you would need to ensure online TRIM is not enabled.
5.) Write a very large number of 2MB (needs to be exact) files to each partition (fill the partitions).
6.) Randomly select, and delete a small fraction of files for the first partition (exactly 10%, perhaps?).
7.) fsync
8.) Run a timed, batch trim operation on the drive.
9.) Repeat step 6, only for the other partition.  Delete the exact same number of files.
10.) fsync
11.) Run a timed, batch trim operation on the drive.
12.) Go back to step #5 a total of two more times, to make sure the numbers are consistent.


Analysis...

If the trim command runs a LOT faster for the 1MB alignment, I think that means the drive didn't actually complete the TRIM, due to those erase blocks still containing data.

If they took the same amount of time and extremely fast, I think that means the trim failed for both cases and the alignment is too small.

If they took the same amount of time and much slower, I think that means the trim succeeded for both cases, and thus either alignment works.




It all comes down to the page size.

If the page size is 2KB (from 2007 - ex. google "typical mlc page size" and pick the micron link), the erase block is 512KB.  If the page size is 8KB (like ours? - ex. http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed/2), we're actually looking at 2MB alignments.

An article at http://www.anandtech.com/show/6388/intel-ssd-335-240gb-review references an in-development Intel chip that will use 4MB erase block sizes, btw.

Comment 38 Marja Van Waes 2013-04-08 11:33:27 CEST

Changing summary back to the original one and closing as fixed, sorry for having contributed to mixing issues in this report.

See https://ml.mageia.org/l/arc/doc-discuss/2013-04/msg00037.html

(click this link again after confirming you're not a spammer)

I'll try to open the enhancement request for diskdrake to allow the user to specify the alignment to use, later today (please ping me if I forget)

(In reply to Thierry Vignaud from comment #22)
> BTW, I think this migh break windows booting after resizing (two people
> reported to me windows failed to boot after resizing it for installing
> Mageia)

@ Thierry,

if you can get more information about this, can you then please open a new bug report for it?

Status: NEW => RESOLVED
Resolution: (none) => FIXED
Summary: Resizing Windows partition on SSD might lead to unbootable windows. => Partitioning issues with 4k drives ("advanced format")
Whiteboard: 3beta1 (Mga2) => (Mga2)

Comment 39 Richard Houser 2014-12-31 06:00:59 CET

I realize this bug is now closed, but I found references to the new 19nm keeping the same 2MiB settings, so including here in case alignment needs to be revisited later.

http://www.anandtech.com/show/4284/sandisktoshiba-take-back-the-crown-with-a-different-kind-of-nand

I'll verify the default partitioning when 5-beta2 drops, and open a new bug with reference if required.

Note You need to log in before you can comment on or make changes to this bug.