See this page as a slide show
Chapter 8: Storage
CT 320: Network and System Administration
Colorado State University
Computer Science Department
Original slides from Dr. James Walden at Northern Kentucky University.
Topics
- Disk interfaces
- Disk components
- Performance
- Reliability
- RAID
- Adding a disk
- Logical volumes
- Filesystems
Disk Interfaces
- SCSI
- Standard interface for servers
- IDE, EIDE
- Historical interface for PCs
- SATA
- Serial ATA standard on PCs
- Fibre Channel
- High bandwidth, SCSI or ATA
- USB
- Fast enough for slow devices on PCs
SCSI
- Small Computer Systems Interface
- Pronunskiation
- Fast, reliable, expensive
- A bus, not a simple PC to device interface
- Each device has a target # ranging 0–7 or 0–15.
- Devices can communicate directly without CPU
- Many versions
- Original: SCSI-1 (1979) — 5MB/s
- Current: SCSI-3 (2003) — 640MB/s
- Serial Attached SCSI (SAS)
- Up to 128 devices
- Up to 750 MB/s full duplex
IDE
- Integrated Drive Electronics / AT attachment
- Slower, less reliable, cheap
- Only allows 2 devices per interface
- ATAPI standard added removable devices
- Many versions
- Original: IDE / ATA (1984) — 16.7 MB/s
- Current: Ultra-ATA/167 (2010) — 167MB/s
- Serial ATA
- Up to 128 devices
- Original: SATA Revision 1.0 — 150 MB/s
- Current: SATA Revision 3.0 — 600 MB/s
SATA vs. SCSI
- SCSI offers better performance/scale
- Faster bus
- Faster hard drives (up to 15,000rpm)
- Lower CPU usage
- Better handling of multiple requests
- SATA often best for workstations
- Convergence
- SATA2 and SAS converging on a single standard
Hard Drive Components
- Actuator
- Moves arm across disk to read/write data.
- Arm has multiple read/write heads (often 2/platter.)
- Platters
- Rigid substrate material
- Thin magnetic material coating stores data
- Coating type determines density: Gbits/in²
- Spindle Motor
- Spins platters from 3600–15,000 rpm
- Speed determines disk latency
- Cache
- 2–16MB of cache memory
- Reliability: write-back vs. write-through
Disk Information: hdparm
# hdparm -i /dev/hde
/dev/hde:
Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WDWMA8C4533667
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: device does not report version:
* signifies the current active mode
Disk Performance
- Seek Time
- Time to move head to desired track (3–8 ms)
- Rotational Delay
- Time until head over desired block (8ms for 7200 rpm)
- Latency
- Seek Time + Rotational Delay
- Throughput
- Data transfer rate (up to 600 MB/s)
Latency vs. Throughput
- Which is more important?
- Depends on the type of application
- Sequential access (Throughput)
- Multimedia on a single workstation
- Random access (Latency)
- Content on most web servers
- How to improve performance
- Faster disks
- Disk Caching
- More spindles (disks)
- More disk controllers
Disk Performance: hdparm
# hdparm -tT /dev/hde
/dev/hde:
Timing cached reads:
876 MB in 2.00 seconds = 437.41 MB/sec
Timing buffered disk reads:
88 MB in 3.08 seconds = 28.60 MB/sec
Reliability
- MTBF
- Mean Time Between Failure (>100,000 hours)
- Real failure curves
- Early phase: high failure rate from defects
- Constant failure rate phase: MTBF valid
- Wearout phase: high failure rate from wear
- Failures more likely on traumatic events
- Systems often wear out before MTBF
- However, disk drives still crash!
RAID
- Redundant Array of Inexpensive Disks
Redundant Array of Independent Disks
- Can be implemented in hardware or software.
- Hardware RAID controllers:
- Supports caching
- Higher capacity
- Higher reliability
- Better throughput
RAID Levels
- RAID 0: Striped evenly for performance
- MTBF = (average MTBF)/# disks
- JBOD: Concatenated for capacity
- Or does it mean no RAIDing at all?
- Only data on bad disk is lost, no performance effect
- RAID 1: Mirrored for reliability
- Every write goes to each disk of set
- Seek time halved as reads split between disks
- RAID 0 + 1: Striped + mirrored
- RAID 5: Striped with parity
- Block striping, not disk striping
- Can lose one disk of set without losing data.
Adding a Disk
- Install new hardware
- Verify disk recognized by BIOS.
- Boot
- Verify device exists in
/dev
- Partition:
fdisk /dev/sdb
- Create filesystem:
mkfs -v -t ext4 /dev/sdb1
- Add to
/etc/fstab
: /dev/sdb1 /proj ext3 defaults 0 2
- Mount all disks:
mount -a
When don’t you need a filesystem?
mkswap -v /dev/sdb1
- Server applications (use their own)
- Oracle (OCFS)
- VMWare Server (VMFS)
Logical Volumes
- What are logical volumes?
- Appear to user as a physical volume.
- But can span multiple partitions and/or disks.
- Why logical volumes?
- Aggregate disks for performance/reliability.
- Grow and shrink logical volumes on the fly.
- Move logical volumes between physical devices.
- Replace volumes without interrupting service.
LVM
Logical Volume Manager
LVM Components
- Logical Volume Group (LVG)
- Set of physical volumes (partitions or disks.)
- May be divided into logical volumes (LVs.)
- LVs made up of fixed sized logical extents
- Each LE is 4MB.
- Physical extents are the same size.
Mapping Modes
- Linear Mapping
- LVs assigned to continguous areas of PV space.
- Striped Mapping
- LEs interleaved across PVs to improve performance.
Setting up an LVG and LV
- Initialize physical volumes
# pvcreate /dev/hda1
# pvcreate /dev/hdb1
- Initialize a volume group
vgcreate nku_proj /dev/hda1 /dev/hdb1
- Use
vgextend
to add more PVs later.
lvcreate -n nku1 --size 100G nku_proj1
mkfs -v -t ext3 /dev/nku_proj/nku1
Extending a LV
lvextend -L120G /dev/nku_proj/nku1
lvextend -L+20G /dev/nku_proj/nku1
- Expand the filesystem without unmounting
ext2online -v /dev/nku_proj/nku1
df -h
Swap
- Can use swapfile instead of swap partition
dd if=/dev/zero of=/swapfile bs=1024k count=512
mkswap /swapfile
swapon /swapfile
swapon /dev/sda2
swapoff /swapfile
swapoff /dev/sda2
- Check swap resource usage
cat /proc/swaps
Filesystems
- ext2
- Old Linux non-fragmenting fast filesystem
- Can be converted to ext3 by adding a journal:
tune2fs -j /dev/sda1
- ext3
- Journaling “eliminates” need for
fsck
- ext4
- Current common Linux filesystem
- Big files (16TB)
- Extents (range of contiguous physical blocks)
- 34-bit seconds + nanosecond timestamps
Other Filesystems
- tmpfs, ramfs: all in memory
- vfat, ntfs: Windows
- exFAT: flash drive (spreads out the work)
- hfs: Mac OS
- procfs:
/proc
- cramfs, squashfs: Read-only compressed file systems
- ISO9660: CD-ROM & DVD-ROM disks
- UDF: CD-RW & DVD-RW
Mounting
# mount /dev/sda1 /mnt
# df -h /mnt
- Automatic mounting
- Unmount
umount /dev/sda1
- Cannot unmount a volume in use.
fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
/dev/hdc1 / ext3 defaults 0 1
/dev/hdc5 /win vfat user,rw 0 0
/dev/hdc7 none swap sw 0 0
/dev/hdc8 /var ext3 defaults 0 2
/dev/hdc9 /home ext3 defaults 0 2
/dev/hda /media/cdrom0 iso9660 ro,user 0 0
/dev/fd0 /media/floppy0 auto rw,user 0 0
/etc/fstab first field
The first field of an /etc/fstab
can be:
/dev/sda1
Fails if you add another disk and the order changes.
- UUID=77f85028-b4c1-4439-be1c-5a3ba7f59dd1
Robust, but cryptic. Obtain via blkid
.
- LABEL=JackHomeDir
Robust and self-documenting. Obtain via blkid
; set via
e2label
.
fsck: check + repair fs
- Filesystem corruption sources
- Power failure
- System crash
- Types of corruption
- Unreferenced inodes.
- Bad superblocks.
- Unused data blocks not recorded in block maps.
- Data blocks listed as free that are used in files.
fsck
can fix these and more
- Asks user to make more complex decisions.
- Stores unfixable files in
lost+found
- And where is this
lost+found
, precisely?
Lots of filesystem flavors
# cd /sbin
# ls mkfs*
mkfs mkfs.ext2 mkfs.ext4dev mkfs.minix mkfs.reiserfs
mkfs.btrfs mkfs.ext3 mkfs.fat mkfs.msdos mkfs.vfat
mkfs.cramfs mkfs.ext4 mkfs.hfsplus mkfs.ntfs mkfs.xfs
# ls fsck*
fsck fsck.ext3 fsck.fat fsck.minix fsck.reiserfs
fsck.btrfs fsck.ext4 fsck.hfs fsck.msdos fsck.vfat
fsck.cramfs fsck.ext4dev fsck.hfsplus fsck.ntfs fsck.xfs
fsck.ext2
# ls mount*
mount.cifs mount.glusterfs mount.nfs mount.ntfs mount.ntfs-fuse
mount.fuse mount.lowntfs-3 mount.nfs4 mount.ntfs-3g mountstats
References
- Aeleen Frisch, Essential System Administration, 3rd edition,
O’Reilly, 2002.
- Charles M. Kozierok, “Reference Guide—Hard Disk Drives,”
http://www.pcguide.com/ref/hdd/, 2005.
- A.J. Lewis, LVM HOWTO,
http://www.tldp.org/HOWTO/LVM-HOWTO/index.html, 2005.
- H. Mauelson and M. O’Keefe, “The Linux Logical Volume Manager,”
Red Hat Magazine,
http://www.redhat.com/magazine/009jul05/features/lvm2/, July 2005.
- Octane, “SCSI Technology Primer,”
http://arstechnica.com/paedia/s/scsi-1.html, 2002.
- RedHat, RHEL4 System Administration Guide,
http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/sysadmin-guide/,
2005.