Show Lecture.Storage as a slide show.
CT320 Storage
CT 320: Network and System Administration
Colorado State University
Computer Science Department
Original slides from Dr. James Walden at Northern Kentucky University.
Topics
- Disk interfaces
- Disk components
- Performance
- Reliability
- RAID
- Adding a disk
- Logical volumes
- Filesystems
Disk Interfaces
- SCSI
- Standard interface for servers
- IDE, EIDE
- Historical interface for PCs
- SATA
- Serial ATA standard on PCs
- Fibre Channel
- High bandwidth, SCSI or ATA
- USB
- Fast enough for slow devices on PCs
SCSI
- Small Computer Systems Interface
- Pronunskiation
- Fast, reliable, expensive
- A bus, not a simple PC to device interface
- Each device has a target # ranging 0–7 or 0–15.
- Devices can communicate directly without CPU
- Many versions
- Original: SCSI-1 (1979) — 5MB/s
- Current: SCSI-3 (2003) — 640MB/s
- Serial Attached SCSI (SAS)
- Up to 128 devices
- Up to 750 MB/s full duplex
IDE
- Integrated Drive Electronics / AT attachment
- Slower, less reliable, cheap
- Only allows 2 devices per interface
- ATAPI standard added removable devices
- Many versions
- Original: IDE / ATA (1984) — 16.7 MB/s
- Current: Ultra-ATA/167 (2010) — 167MB/s
- Serial ATA
- Up to 128 devices
- Original: SATA Revision 1.0 — 150 MB/s
- Current: SATA Revision 3.0 — 600 MB/s
SATA vs. SCSI
- SCSI offers better performance/scale
- Faster bus
- Faster hard drives (up to 15,000rpm)
- Lower CPU usage
- Better handling of multiple requests
- SATA often best for workstations
- Convergence
- SATA2 and SAS converging on a single standard
Hard Drive Components
- Actuator
- Moves arm across disk to read/write data.
- Arm has multiple read/write heads (often 2/platter.)
- Spindle Motor
- Spins platters ~7200 rpm
- Speed determines disk latency
Hard Drive Components
- Platters
- Rigid substrate material
- Thin magnetic material coating stores data
- Coating type determines density: 1.34 Tbit/in² in 2015
- Cache
- hard disk: 8–256MB cache, SSD: 4GB cache
- Reliability: write-through vs. write-back
- Write-through: write to both cache & disk
- Write-back (alias write-behind):
write to cache, posting to disk later, as needed
Disk Information: hdparm
# hdparm -i /dev/sda1
/dev/sda1:
Model=Hitachi HTS543216L9A300, FwRev=FB2OC40C, SerialNo=081107FB2232LCHTGKLA
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
BuffType=DualPortCache, BuffSize=7114kB, MaxMultSect=16, MultSect=16
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=312581808
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=yes: mode=0x80 (128) WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7
* signifies the current active mode
Disk Performance
- Seek Time
- Time to move head to desired track (~9 ms)
- Rotational Delay
- Time until head over desired block (8ms for 7200 rpm)
- Latency
- Seek Time + Rotational Delay
- Throughput
- Data transfer rate (~1Gb/s)
- Affected by both rotational speed & data density
Latency vs. Throughput
- Which is more important?
- Depends on the type of application
- Sequential access (Throughput)
- Multimedia on a single workstation
- Random access (Latency)
- Content on most web servers
- How to improve performance
- Faster disks
- Disk Caching
- More spindles (disks)
- More disk controllers
Disk Performance: hdparm
# hdparm -tT /dev/sda1
/dev/sda1:
Timing cached reads: 1256 MB in 2.00 seconds = 627.87 MB/sec
Timing buffered disk reads: 172 MB in 3.02 seconds = 57.00 MB/sec
Reliability
- MTBF
- Mean Time Between Failure (>100,000 hours)
- Real failure curves
- Early phase: high failure rate from defects
- Constant failure rate phase: MTBF valid
- Wearout phase: high failure rate from wear
- Failures more likely on traumatic events
- Systems often wear out before MTBF
- However, disk drives still crash!
RAID
- Redundant Array of Inexpensive Disks
Redundant Array of Independent Disks
- Can be implemented in hardware or software.
- Hardware RAID controllers:
- Supports caching
- Higher capacity
- Higher reliability
- Better throughput
RAID Levels
- RAID 0: Striped evenly for performance
- MTBF = (average MTBF)/# disks
- JBOD: Concatenated for capacity
- Or does it mean no RAIDing at all?
- Only data on bad disk is lost, no performance effect
- RAID 1: Mirrored for reliability
- Every write goes to each disk of set
- Seek time halved as reads split between disks
- RAID 0 + 1: Striped + mirrored
- RAID 5: Striped with parity
- Block striping, not disk striping
- Can lose one disk of set without losing data.
Redundancy
Adding a Disk
- Install new hardware
- Verify disk recognized by BIOS.
- Find the device name:
cat /proc/partitions
df -h
- Partition:
fdisk /dev/sdb
or gparted /dev/sdb
- Create filesystem:
mkfs -v -t ext4 /dev/sdb1
- Add to
/etc/fstab
: /dev/sdb1 /proj ext3 defaults 0 2
- Mount all disks:
mount -a
When don’t you need a filesystem?
mkswap -v /dev/sdb1
- Server applications (use their own)
- Oracle (OCFS)
- VMWare Server (VMFS)
Logical Volumes
- What are logical volumes?
- Appear to user as a physical volume.
- But can span multiple partitions and/or disks.
- Why logical volumes?
- Aggregate disks for performance/reliability.
- Grow and shrink logical volumes on the fly.
- Move logical volumes between physical devices.
- Replace volumes without interrupting service.
LVM
Logical Volume Manager
LVM Components
- Logical Volume Group (LVG)
- Set of physical volumes (partitions or disks.)
- May be divided into logical volumes (LVs.)
- LVs made up of fixed sized logical extents
- Each LE is 4MB.
- Physical extents are the same size.
Mapping Modes
- Linear Mapping
- LVs assigned to continguous areas of PV space.
- Striped Mapping
- LEs interleaved across PVs to improve performance.
Setting up an LVG and LV
- Initialize physical volumes
# pvcreate /dev/hda1
# pvcreate /dev/hdb1
- Initialize a volume group
vgcreate nku_proj /dev/hda1 /dev/hdb1
- Use
vgextend
to add more PVs later.
lvcreate -n nku1 --size 100G nku_proj1
mkfs -v -t ext3 /dev/nku_proj/nku1
Extending a LV
lvextend -L120G /dev/nku_proj/nku1
lvextend -L+20G /dev/nku_proj/nku1
- Expand the filesystem without unmounting
ext2online -v /dev/nku_proj/nku1
df -h
Swap
- Can use swapfile instead of swap partition
dd if=/dev/zero of=/swapfile bs=1024k count=512
mkswap /swapfile
swapon /swapfile
swapon /dev/sda2
swapoff /swapfile
swapoff /dev/sda2
- Check swap resource usage
cat /proc/swaps
Filesystems
- ext2
- Old Linux non-fragmenting fast filesystem
- Can be converted to ext3 by adding a journal:
tune2fs -j /dev/sda1
- ext3
- Journaling “eliminates” need for
fsck
- ext4
- Current common Linux filesystem
- Big files (16TB)
- Extents (range of contiguous physical blocks)
- 34-bit seconds + nanosecond timestamps
2038
$ perl -wle 'print 0x7fffffff'
2147483647
$ date -d 'january 1, 1970 + 2147483647 seconds'
Tue Jan 19 03:14:07 MST 2038
You probably don’t remember January 1st, 2000,
but you sure will remember January 19, 2038.
Other Filesystems
- tmpfs, ramfs: all in memory
- vfat, ntfs: Windows
- exFAT: flash drive (spreads out the work)
- hfs: Mac OS
- procfs:
/proc
- cramfs, squashfs: Read-only compressed file systems
- ISO9660: CD-ROM & DVD-ROM disks
- UDF: CD-RW & DVD-RW
Mounting
# mount /dev/sda1 /mnt
# df -h /mnt
- Automatic mounting
- Unmount
umount /dev/sda1
- Cannot unmount a volume in use.
fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point> <type> <options> <dump> <pass>
UUID=77f85028-b4c1-4439-be1c-5a3ba7f59dd1 / ext3 defaults 0 1
LABEL=windows /win vfat user,rw 0 0
/dev/hdc8 /home ext3 defaults 0 2
/dev/hdc7 none swap sw 0 0
proc /proc proc defaults 0 0
/etc/fstab first field
The first field of an /etc/fstab
can be:
- An actual device, e.g.,
/dev/sda1
Fails if you add another disk and the order changes.
UUID=
whatever
Robust, but cryptic. Obtain via blkid.
LABEL=
whatever
Robust and self-documenting. Obtain via blkid
; set via
mkfs.type, e2label, fatlabel, etc.
- Ignored for pseudo-filesystems such as
proc
or tmpfs
.
fsck: check + repair fs
- Filesystem corruption sources
- Power failure
- System crash
- Types of corruption
- Unreferenced inodes.
- Bad superblocks.
- Unused data blocks not recorded in block maps.
- Data blocks listed as free that are used in files.
fsck
can fix these and more
- Asks user to make more complex decisions.
- Stores unfixable files in
lost+found
- And where is this
lost+found
, precisely?
Lots of filesystem flavors
$ cd /sbin
$ ls -F mkfs.*
mkfs.cramfs* mkfs.ext3* mkfs.fat* mkfs.msdos@ mkfs.vfat@
mkfs.ext2* mkfs.ext4* mkfs.minix* mkfs.ntfs@ mkfs.xfs*
$ ls -F fsck.*
fsck.cramfs* fsck.ext3* fsck.fat* fsck.msdos@ fsck.vfat@
fsck.ext2* fsck.ext4* fsck.minix* fsck.ntfs@ fsck.xfs*
$ ls -F mount.*
mount.cifs* mount.glusterfs* mount.nfs4@ mount.ntfs-fuse@
mount.fuse* mount.lowntfs-3g@ mount.ntfs@ mount.smb3@
mount.fuse3* mount.nfs* mount.ntfs-3g@
References
- Aeleen Frisch, Essential System Administration, 3rd edition,
O’Reilly, 2002.
- Charles M. Kozierok, “Reference Guide—Hard Disk Drives,”
http://www.pcguide.com/ref/hdd/, 2005.
- A.J. Lewis, LVM HOWTO,
https://www.tldp.org/HOWTO/LVM-HOWTO/index.html, 2005.
- H. Mauelson and M. O’Keefe, “The Linux Logical Volume Manager,”
Red Hat Magazine,
https://www.redhat.com/magazine/009jul05/features/lvm2/, July 2005.
- Octane, “SCSI Technology Primer,”
http://arstechnica.com/paedia/s/scsi-1.html, 2002.
- RedHat, RHEL4 System Administration Guide,
https://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/sysadmin-guide/,
2005.