How to Build a ProxMox Hypervisor Cluster with NAS Disk

After struggling to recover a moderately important VM on one of my home lab servers running generic CentOS libvirt, a colleague suggested I investigate ProxMox as a replacement to libvirt since it offers some replication and clustering features. The test was quick and I was very impressed with the features available in the community edition. It took maybe 15-30 minutes to install and get my first VM running. I quickly rolled ProxMox out on my other two lab servers and started experimenting with replication and migration of VMs between the ProxMox cluster nodes.

The recurring pain I was experiencing with VM hosts centered around primarily failed disks, both HDD and SSD, but also a rare processor failure. I had already decided to invest a significant amount of money into a commercial NAS (one of the major failures was irrecoverability of a TrueNAS VM with some archive files). Although investing in a QNAP or Synology NAS device would introduce a single point of failure for all the ProxMox hosts, I decided to start with one and see if later I could justify the cost for a redundant QNAP. More on that in another article.

The current architecture of my lab environment now looks like this:

Figure 1 – ProxMox Storage Architecture

To reduce the complexity, I chose to setup ProxMox for replication of VM guests and allow live migration but not to implement HA clustering yet. To support this configuration, the QNAP NAS device is configured to advertise a number of iSCSI LUNs, each with a dedicated iSCSI target hosted on the QNAP NAS system. Through trial and error testing I decided to configure four (4) 250GB LUNs for each ProxMox host. All four (4) of those LUNs are added into a new ZFS zpool making 1TB of storage available to each ProxMox host. Since this iteration of the design is not going to use shared cluster aware storage, each host has a dedicated 1TB ZFS pool (zfs-iscsi1) however each pool is named the same to facilitate replication from one ProxMox host to another. For higher performance requirements, I also employ a single SSD on each host which have also been placed into a ZFS pool (zfs-ssd1) named the same on each host.

A couple of notes on architecture vulnerabilities. Each ProxMox host should have dual local disks to allow ZRAID1 mirroring. I chose to have only single SSD in each host to start with and tolerate a local disk failure – replication will be running on critical VM to limit the loss in the case of a local SSD failure. Any VM that cannot tolerate any disk failure will only use the iSCSI disks.

Setup ProxMox Host and Add ProxMox Hosts to Cluster

  • Server configuration: 2x 1TB HDD, 1x 512GB SSD
  • Download ProxMox install ISO image and burn to USB

Boot into the ProxMox installer

Assuming the new host has dual disks that can be mirrored, chose Advanced for the boot disk and select ZRAID1 – this will allow you to select the two disks to be mirrored

Follow the installation prompts and sign in on the console after the system reboots

  • Setup the local SSD as ZFS pool “zfs-ssd1”

Use lsblk to identify local disks attached to find the SSD

 lsblk

sda      8:0    0 931.5G  0 disk 
├─sda1   8:1    0  1007K  0 part 
├─sda2   8:2    0     1G  0 part 
└─sda3   8:3    0 930.5G  0 part 
sdb      8:16   0 476.9G  0 disk 
sdc      8:32   0 931.5G  0 disk 
├─sdc1   8:33   0  1007K  0 part 
├─sdc2   8:34   0     1G  0 part 
└─sdc3   8:35   0 930.5G  0 part 

Clear the disk label if any and create empty GPT

 sgdisk --zap-all /dev/sdb
 sgdisk --clear --mbrtogpt /dev/sdb

Create ZFS pool with the SSD

 zpool create zfs-ssd1 /dev/sdb
 zpool list

NAME       SIZE  ALLOC   FREE  ...   FRAG    CAP  DEDUP    HEALTH
rpool      928G  4.44G   924G          0%     0%  1.00x    ONLINE
zfs-ssd1   476G   109G   367G          1%    22%  1.00x    ONLINE

Update /etc/pve/storage.cfg and ensure ProxMox host is listed as a node for zfs-ssd1 pool. Initial entry can only list the first node. When adding another ProxMox host, the new host gets added to the nodes list.

zfspool: zfs-ssd1
	pool zfs-ssd1
	content images,rootdir
	mountpoint /zfs-ssd1
	nodes lab2,lab1,lab3

Note the /etc/pve files are maintained in a global filesystem and any edits while on one host will reflect on all other ProxMox cluster nodes.

  • Configure QNAP iSCSI targets with attached LUNs
      • Configure network adapter on ProxMox host for the direct connection to QNAP, ensure MTU is set 9000 and speed 2.5Gb
      • Setup iSCSI daemon and disks for creation of zfs-iscsi1 ZFS pool

      Update /etc/iscsi/iscsid.conf to setup automatic start, CHAP credentials

       cp /etc/iscsi/iscsid.conf /etc/iscsi/iscsid.conf.orig
      
      node.startup = automatic
      node.session.auth.authmethod = CHAP
      node.session.auth.username = qnapuser
      node.session.auth.password = hUXxhsYUvLQAR
      
       chmod o-rwx /etc/iscsi/iscsid.conf
       systemctl restart iscsid
       systemctl restart open-iscsi

      Validate connection to QNAP, ensure no sessions exist do discovery of published iSCSI targets. Ensure to use the high speed interface address of the QNAP.

       iscsiadm -m session -P 3
      
      No active sessions
      
       iscsiadm -m discovery -t sendtargets -p 10.3.1.80:3260
      
      10.3.1.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-0.5748c4
      10.3.5.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-0.5748c4
      10.3.1.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-1.5748c4
      10.3.5.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-1.5748c4
      10.3.1.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-2.5748c4
      10.3.5.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-2.5748c4
      10.3.1.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-3.5748c4
      10.3.5.80:3260,1 iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-3.5748c4

      In the output of the discovery it appears there are two sets of targets. This is due to multiple network adapters under Network Portal on the QNAP being included in the targets. We will use the high speed address (10.3.1.80) for all the iscsiadm commands.

      Execute login to each iSCSI target

       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-0.5748c4 -p 10.3.1.80:3260 -l
      Logging in to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-0.5748c4, portal: 10.3.1.80,3260]
      Login to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-0.5748c4, portal: 10.3.1.80,3260] successful.
      
       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-1.5748c4 -p 10.3.1.80:3260 -l
      Logging in to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-1.5748c4, portal: 10.3.1.80,3260]
      Login to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-1.5748c4, portal: 10.3.1.80,3260] successful.
      
       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-2.5748c4 -p 10.3.1.80:3260 -l
      Logging in to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-2.5748c4, portal: 10.3.1.80,3260]
      Login to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-2.5748c4, portal: 10.3.1.80,3260] successful.
      
       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-3.5748c4 -p 10.3.1.80:3260 -l
      Logging in to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-3.5748c4, portal: 10.3.1.80,3260]
      Login to [iface: default, target: iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-3.5748c4, portal: 10.3.1.80,3260] successful.

      Verify iSCSI disks were attached

       lsblk
      NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
      sda      8:0    0 931.5G  0 disk 
      ├─sda1   8:1    0  1007K  0 part 
      ├─sda2   8:2    0     1G  0 part 
      └─sda3   8:3    0 930.5G  0 part 
      sdb      8:16   0 476.9G  0 disk 
      ├─sdb1   8:17   0 476.9G  0 part 
      └─sdb9   8:25   0     8M  0 part 
      sdc      8:32   0 931.5G  0 disk 
      ├─sdc1   8:33   0  1007K  0 part 
      ├─sdc2   8:34   0     1G  0 part 
      └─sdc3   8:35   0 930.5G  0 part 
      sdd      8:48   0   250G  0 disk 
      sde      8:64   0   250G  0 disk 
      sdf      8:80   0   250G  0 disk 
      sdg      8:96   0   250G  0 disk

      Create GPT label on new disks

       sgdisk --zap-all /dev/sdd
       sgdisk --clear --mbrtogpt /dev/sdd
       sgdisk --zap-all /dev/sde
       sgdisk --clear --mbrtogpt /dev/sde
       sgdisk --zap-all /dev/sdf
       sgdisk --clear --mbrtogpt /dev/sdf
       sgdisk --zap-all /dev/sdg
       sgdisk --clear --mbrtogpt /dev/sdg

      Create ZFS pool for iSCSI disks

       zpool create zfs-iscsi1 /dev/sdd /dev/sde /dev/sdf /dev/sdg
       zpool list
      
      NAME       SIZE  ALLOC   FREE  ...   FRAG    CAP  DEDUP    HEALTH
      rpool      928G  4.44G   924G          0%     0%  1.00x    ONLINE
      zfs-ssd1   476G   109G   367G          1%    22%  1.00x    ONLINE
      zfs-iscsi1 992G   113G   879G          0%    11%  1.00x    ONLINE

      Setup automatic login on boot for iSCSI disks

       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-0.5748c4 -p 10.3.1.80 -o update -n node.startup -v automatic
      
       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-1.5748c4 -p 10.3.1.80 -o update -n node.startup -v automatic
      
       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-2.5748c4 -p 10.3.1.80 -o update -n node.startup -v automatic
      
       iscsiadm -m node -T iqn.2005-04.com.qnap:ts-873a:iscsi.lab1-3.5748c4 -p 10.3.1.80 -o update -n node.startup -v automatic

      Update /etc/pve/storage.cfg for the zfs-iscsi1 ZFS pool to show up in the ProxMox GUI. Initial entry can only list the first node. When adding another ProxMox host, the new host gets added to the nodes list.

      zfspool: zfs-iscsi1
      	pool zfs-iscsi1
      	content images,rootdir
      	mountpoint /zfs-iscsi1
      	nodes lab2,lab1,lab3

      Next I will cover the configuration of VM for disk replication across one or more ProxMox hosts.

      Building an Irrigation Power Controller

      To complement the Raspberry Pi based Garden Controller I’ve designed and built, I decided to separate the irrigation valve and pump control onto a separate project and circuit board in order to make it more generically applicable. My intent is to make this Power Controller useful for any designer that has a computer capable of I2C communication.

      Since this new Power Controller will be a generic controller that uses I2C communication, provides 24VAC for irrigation valves and also control of external power relays for two pumps, it needs to provide all it’s own power versus only relying on a 3.3v feed from the I2C host. I decided this will be the initial feature set:

      • Overall
      • Control of and power for five (5) standard 24VAC irrigation valves
      • Control of two (2) external 12VDC relays to drive 120VAC pumps (or any supply voltage controlled by the 12VDC relays)
      • Three (3) general purpose I/O lines, programmable for input or output
      • Optional AC frequency sense output to host computer
      • Inputs
      • 1x 120VAC xxA
      • 1x 3.3v I2C control bus (SCL, SDA)
      • 3x 5v generic I/O lines
      • Outputs
      • 5VDC 200mA max external
      • 12VDC 26mA max external
      • 5x 24VAC direct drive for valves
      • 2x 12VDC external relay for pump control
      • 3.3v interrupt line to host computer for AC frequency sensing

      3D PCB View

      PowerController v2.4.1 Board

      Design and Prototype

      I used Kicad schematic and PCB design tools to begin building a dedicated power controller board. Found there was a trade off between having all of the switching and power functions on the single board versus moving the 120VAC switching off board to dedicated power relays. Given the possible high VAC current requirement for switching pumps, I found using the external relays allowed me to substantially reduce the PCB size and width of traces. At the same time, using a transformer that could supply 24VAC directly to the irrigation valves would enable a compact design with minimal external components to the controller board. To offer maximum flexibility, I chose to add the ability to use a small number of the GPIO lines for either input or output along with a modest amount of external 3.3v, 5v and 12v power. The GPIO controller I chose was the MCP230017 which includes a built in I2C interface and two banks of 8 individually addressable I/O lines.

      Due to the best valve relay choice and the fact that most of the external sensors I wanted to use, I chose to drive the MCP23017 at 5v although seems possible to also drive at 3.3v. For I2C and interrupt line to host interfaces I included 3.3v – 5v level converters.

      v2.4.1 Schematic

      For each of the relays we don’t need more than x mA so simple NPN 2N2222 transistors can be used to switch the relay coil voltage. Despite opting to use through-hole components versus surface mount so this board is easier to make for any hobbyists who use this design, I did want to reduce the component count and footprint where possible. Two areas include any pull-up or limiting resistors and all the kick-back diodes on all the switching transistors. DIP and SIP packages reduce both component count and board real estate.

      Verification kick-back diodes working as intended

      Power Supply Design

      Since the Power Controller needs to provide power for the irrigation valves and external pump control relays, I started with the valve requirements. Need to have 24VAC available to drive up to five (5) valves. I chose to use professional grade RainBird x valves since I want reliable operation given the extensive irrigation I’m choosing to install. Given the in-rush and holding current required for each valve, the Hammond 183K24 transformer will suit given it’s 56VA maximum power factor.

      Power Supply Specifications

      • Hammond 183K24 transformer 56VA max, 24VAC 2.33A max, 12VAC 4.66A max, 24VAC 2.33A max, 12VAC 4.66A max
      • 5VDC 400mA max 2W (200mA internal max, 200mA external max)
      • 12VDC  167mA max 2W (140mA internal max, 26mA external max)
      • 24VAC  1.5A at 5 valves, leaves 0.83A
      • Valve requirement is 0.3A in-rush, 0.23A holding current (5 valves = 1.5A in-rush, 1.15A holding current)
      • Valve power 24VAC x 1.5A = 36W (27.6W holding)

      Output (Internal)

      5VDC 400mA max

        ..  each relay 40mA .. 200mA all on x 5 VDC = 1W

        ..  200 mA max external supply = 2W

      Output (External)

      5x  24VAC xx A – direct drive for 24VAC valves

      2x  12VDC xx A – indirect drive for external 12VDC 120VAC relays

      1x  3.3v AC sense interrupt line

      3x  5v I/O lines programmable input or output

      External 120VAC relays

      Tnisesm 2PCS Power Relay DC12V Coil, 30A SPDT(1NO 1NC) 120 VAC with Flange Mounting and 10 Quick Connect Terminals Wires Mini Relay NT90-DC12V-10X

        ..  70mA per coil .. 140mA x 12 VDC = 1.68W

      Maximum power dissipation on LM340 without heatsink at 50C .. 2W

      Other power dissipation guides: 10W enclosed, 20W vented with no heat sinks

      Recommend no more than two (2) valves active simultaneously to limit dissipation in the enclosure that houses the PowerController.  Pumps can be run simultaneously with valves due to low current draw.

      Printed Circuit Board Design

      Imported the schematic design to the KiCad PCB Editor and was quickly able to setup board dimensions, component layout and solder zones. Chose a two layer board to keep costs down, even though I chose to use a PCB manufacturer stateside. OSHPark is a great organization to work with that produces top quality boards. General design guidelines I followed include assigning hot or neutral to each of the two layers as well as north-south or east-west paths. Even following those guidelines I did need to use a small number of vias to route a path across a layer’s traces.

      Trace view of printed circuit board

      I heavily used the KiCad PCB Editor 3D view feature to help validate component placement on the circuit board. Most of the components were sources from Mouser Electronics, which has links to component symbols and footprints usable with most electronic design software (EDS) packages including KiCad. Where there wasn’t a component library available, typically you can request a part library be created. Samac will generate a zip file that contains symbol, footprint and even 3D model.

      Assembly

      Assembly of the board started with all the low profile components to make it easier to lay flat and hold the components in place while soldering.

      Testing

      Once all the low power components were installed I ran some simple bench tests by feeding 3.3v and 5v to the power via the I/O block intended to provide power output once the board is completed. Only the passive components were plugged into their respective sockets so voltage tests and relay tests could be done.

      Both Valve 1 and Pump 1 were successfully activated via the manual test buttons. Since there wasn’t any high voltage or on-board power yet, I determined the Valve 1 test was successful by running a resistance check from V1+ terminal screw to a 24VAC FUSED pad and V1- terminal screw to a LV NEUT pad. All the other terminal screws were open/off to both their respective 24VAC and LV NEUT feeds.

      Pump 1 activated by push button successfully provided 12VDC to the Pump 1 screw terminals.

      Active Component Testing

      I removed all power and installed the MCP23017 IC then connected the 3V3, SDA, SCL and GND terminal block pins to the test Raspberry Pi. I also ran 5VDC from the Raspberry to the 5VDC I/O terminal block since the power supply components are not soldered in yet.

      Powered on the Raspberry Pi and saw the 3V3 and 5V LEDs light up. Ran an i2cdetect to test connection to the PowerController:

      Power Controller Software

      I chose to use the MCP23017 integrated GPIO controller due to the inclusive and compact hardware (simple single 20 pin DIP) and the extensive Python libraries available to simplify coding for status and control.

      At github.com/allenpomeroy/PowerController I have developed a Python based control script powercontroller2.py

      powercontroller2.py
      –i2caddress: I2C address of the PowerController board, default 0x24
      –testcount: Number of test cycles to run, default 3
      –testontime: On time for tests (sec), default 1
      –syslog: Send syslog status messages
      –testofftime: Off time for tests (sec), default 1
      –verbose: Print progress messages
      –relay: [‘valve1’, ‘valve2’, ‘valve3’, ‘valve4’, ‘valve5’, ‘pump1’, ‘pump2’, ‘test’, ‘all’]
      Name of relay to operate on
      –action: [‘on’, ‘off’]
      Action to perform on relay. Note relay ‘all’ can only accept action ‘off’

      While there is no limitation in the control script to prohibit multiple relays engaged at the same time, it is recommend to have no more than two (2) valves active simultaneously to limit heat dissipation in the enclosure that houses the PowerController.  Both pumps can be run simultaneously with valves due to low current draw of the control relays.

      Nepal Trek Spring 2022 Tsum Valley

      My brother John and I have joined a group of fellow trekkers that are undertaking a charity trip to the Tsum Valley in Nepal.

      Help for the Tsum Valley

      Our friends at the Compassion Project have visited Nepal and the Tsum Valley many times and have written books on the wonderful people and locale.  Coming to understand the challenges of training and retaining health care workers and teachers in the Tsum area resulted in our group deciding to not only do a spring 2022 trek through the Tsum Valley but also to raise funds to improve both health care and education opportunities.  Lacking viable healthcare and education forces Tsum residents to travel to larger towns or even to Kathmandu.  The travel is long, expensive and has long lasting negative effects on the Tsum community.  We have an opportunity to provide substantial help and improve the conditions of these Tsum residents.

      We are raising funding for two primary goals:  Improve healthcare and also education opportunities for the Tsum Valley.  Specifically, healthcare funds go to: (1) purchasing medicine and medical supplies, and (2) nurse’s salary.  Education funds go to: (1) teacher’s salary, (2) school supplies, and most important (3) our hot lunch program where every student gets a healthy hot lunch at school.  It is difficult for the Tsum valley to recruit and retain both healthcare workers and teachers due to the rural nature.  This fundraising effort will encourage education of Tsum locals that wish to remain/return to Tsum, helping to improve the community.

      We have setup a fund raising site on CanadaHelps to direct funds into the Compassion Project.

      https://www.canadahelps.org/en/pages/nepal-trek-spring-2022/

      Annual Compassion Health Expenses USDAnnual Compassion Education Expenses USD
      Services
      Staff (2x Health workers, 1x Office)$9,525Teacher$2,253
      Medicine$4,500Hot Lunch Program$7,040
      Office$1,300352/yr * 18 Students + staff
      Transportation$1,500Cook / Grounds Manager$1,950
      Monastery Care Taker$4,150School Supplies$1,000
      $20,975$12,243
      Compassion Project Education
      Compassion Project Medical

      Fund raising goal: $8,000 Current funds raised: $2,850

      About Tsum Valley

      Tsum Valley is a sacred Tibetan Buddhist region and one of the hidden gems of Nepal. With the stunning backdrop of Sringi, Ganesh and Budda Himal mountain ranges, this serene valley is rich in ancient art, culture and religion. It is home to unique and important monasteries and trails are lined with artistic chortens and mani walls made of stone slabs inscribed with Buddhist prayers.

      Trek Picture Gallery

      Trek Timeline

      Trek Day 1

      Trek Day -1

      We tried very hard to stay up and awake until at least 9:30 so we could adjust and were moderately successful. Didn’t wake up until 2am then realized it was unexpectedly quiet and peaceful. No street noise, no crowds, no air conditioners. Turns out we were on the far side of the Stupa from main Kathmandu and surrounded by the monastery that also houses the local Kathmandu Llama so very peaceful. Helps when you can look on the Llamas residence from your hotel!

      Hit a local Momo (local appetizer like a dumpling) cafe for a quick lunch bite before visiting the Durbar Square next to several temples and the Palace Museum.

      The temples and the palace sustained significant damage in the 2015 earthquake and several countries have provided funding and technical expertise to help repair the damage to these ancient buildings. Damage to the ancient intricately carved wood columns is being done by carefully removing the damaged sections and replacing with pieces hand carved to match the original damaged parts.

      Trek Day -2

      The 31 hours of travel from Texas to Kathmandu was not as rough as I would have expected. Qatar Airlines was a refreshing experience even though I have used business class internationally before. Doha Hamad International Airport was spacious and modern. The lounge was a pleasant upgrade and made the six hour layover go quickly. After a chaotic but simple immigration process we were met by Tanzin and Tashi within the huge throng of arrivals, piled into the car and off into the morning Kathmandu traffic.


      Hotel is close to the largest Stupa temple in Kathmandu. After walking around the Boudha Stupa we got a car to Patan to see the temples and palace. The traffic and skilled driver navigation and negotiation of massive traffic was almost as fun as seeing the temples. In the palace there was a display of a couple dozen pictures of the various valleys and the devastating impact of climate change.

      Trek -1 Week

      • Final pack! We ship out this week. Starting to get real. Looking forward to getting to Kathmandu. Even though we’ll be there a couple days ahead of the crew, we have lots lined up to see.

      Trek -4 Weeks

      • Get final vaccines and obsess over equipment

      Trek -5 Weeks

      • Finalize equipment and trial pack
      • Realize I’m probably not training enough

      Trek -3 Months

      https://www.cleverhiker.com/blog/nepal-backpacking-gear-checklist-teahouse-trekking

      • Purchase most of my equipment (go REI and thank goodness for member dividends!)
      • Start pack and hill training
      • Obtain CDC recommended vaccinations

      Trek -4 Months

      • Made the decision to join the Compassion Project Nepal Trek Spring 2022
      • Purchase airfare
      • Initial video conference calls with the organizer and trek group (10 of us)

      Quotes

      Quotes that I’ve collected:

      • Rule 2: Stop getting distracted by things that have nothing to do with your goals. – Robert Downy Jr.
      • In God we trust. All others we verify. – US Airforce
      • Making a decision to live congruently with your values is not quitting. – Tony Robbins
      • Close some doors not because of pride incapacity or arrogance, but simply because they no longer lead somewhere. – Paulo Coelho
      • Patience. Trust. Knowledge. Wisdom. Balance. Kanji tattoos I have and attributes I strived for during my MSc degree … and in life currently. – Allen Pomeroy
      • Workaround: Dont pound on the mouse like a wild monkey. – Sun Microsystems bug ticket
      • Definition of a security consultant: One who wont pass the buck … because he or she will refuse to accept it to begin with. – Jeff N
      • Securing an environment of Windows platforms from abuse – external or internal – is akin to trying to install sprinklers in a fireworks factory where smoking on the job is permitted. – Gene Spafford (to organizers of a workshop on insider misuse)
      • Security is always excessive until its not enough. – Robbie Sinclair, Head of Security, Country Energy, NSW Australia
      • Our greatest glory is not in never failing, but in rising every time we fall. – Confucius
      • Always tell the truth. Then you don’t have to remember anything. – Mark Twain

      Executing an Effective Security Program

      In today’s global Internet connected and reliant IT environment, the issue of corporate networks becoming compromised is a fact. Defense in depth is still and important design pattern, but organizations with even relatively mature capabilities are relying on detection since prevention is simply not enough anymore. Whereas several years ago we used to speak about prevention of externally facing application attacks through coding flaws that lead to SQL Injection and buffer overflow attacks, now successful attackers have moved onto the weakest link: users. Compromise of user credentials now comprises 96% of the successful attacks on organizations. Why go through the brute force and difficult path of application compromised when the attackers can simply conduct a successful spear phishing attack on individuals in the organization?

      This is where advanced detection comes in. User and Entity Behavior Analysis leads to high quality alerts regarding anomalous behavior that is exhibited by accounts where the user has been successfully compromised. Same detection capability exists for detecting users that are exceeding their authority, typically classed as Insider Threat – as well the machine learning can also detect systems (entities) that are behaving in a way that is antithetical to it’s normal behavior. Think of Point of Sale or healthcare Internet of Things devices that have been compromised and there aren’t specific user identities that can be used to profile normal behavior.

      Of all these technologies that can be deployed, the foundation must be a sound information security program that puts policies, standards, guidelines and procedures in place that authorizes and supports the controls. The Security, Cyber, and IA Professionals (SCIAP.org) group have pulled together a concise document that outlines how to build an Effective Security Program.

      Installation notes for ArcSight ESM 6.9.1 on CentOS 7.1

      Aside

      Installation of HPE ArcSight Enterprise Security Manager (ESM) 6.9.1 on CentOS 7.1 is substantially easier with engineering adding a “pre-installation” setup script to this version.  For a smooth installation, there are still a few steps we need to take .. outlined below.

      1. Base install of CentOS 7.1, minimal packages but add Compatibility Libraries. Be sure you use the CentOS-7-x86_64-Minimal-1503-01.iso revision since more recent releases of CentOS have other quirks that may make the ESM install or execution fail. Ensure /tmp has at least 5GB of free space and /opt/arcsight has at least 50GB of usable space – I’d suggest going with at least:
        • /boot – 500MB
        • / – 8GB+
        • swap – 6GB+
        • /opt – 85GB+
      2. Ensure some needed (and helpful) utilities are installed, since the minimal distribution does not include these and unfortunately the ESM install script just assumes they are there .. if they aren’t, the install will eventually fail.
        • yum install -y bind-utils pciutils tzdata zip unzip
        • Edit /etc/selinux/config and disable (or set to permissive) .. the CORR storage engine install will fail with “enforcing” mode of SElinux.  I’ll update this at some point with how to leave SElinux in enforcing mode.
        • Disable the netfilter firewall (again, at some point I’ll update this with the rules needed to leave netfilter enabled).
        • systemctl disable firewalld;  systemctl mask firewalld
        • Install and configure NTP
        • yum install -y ntpdate ntp
        • (optionally edit /etc/ntp.conf to select the NTP servers you want your new ESM system to use)
        • systemctl enable ntpd; systemctl start ntpd
        • Edit /etc/rsyslog.conf and enable forwarding of syslog events to your friendly neighborhood syslog SmartConnector (optional, but otherwise how do you monitor your ESM installation?) .. you can typically just uncomment the log handling statements at the bottom of the file and fill in your syslog SmartConnector hostname or IP address. Note the forward statement I use only has a single at sign – indicating UDP versus TCP designated by two at signs:
        • $ActionQueueFileName fwdRule1 # unique name prefix for spool files
          $ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible)
          $ActionQueueSaveOnShutdown on # save messages to disk on shutdown
          $ActionQueueType LinkedList   # run asynchronously
          $ActionResumeRetryCount -1    # infinite retries if host is down
          # remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
          #*.* @@remote-host:514
          *.* @10.10.10.5:514
        • Restart rsyslog after updating the conf file
        • systemctl restart rsyslog
        • Optionally add some packages that support trouble shooting or other non-ESM functions you run on the ESM server, such as system monitoring
        • yum install -y mailx tcpdump
      3. Untar the ESM distribution tar ball, ensure the files are owned by the “arcsight” user, then run the Tools/prepare_system.sh to adjust the maximum open files and other requirements that we used to manually update in previous releases.  NOTE: in 6.9.1 there are some previous “shadow” requirements that are now enforced (eg. you don’t get to change) .. such as the application owner account must be “arcsight”, the installation directory must be “/opt/arcsight”.  The “prepare_system.sh” script will check to see if there already is an “arcsight” user and if not, will create it.  I usually manually create all the common users on my various systems since I want them to have the same uid / gid across all my systems.
      4. Run the Tools/prepare_system.sh script as “root” user
        • cd Tools
        • ./prepare_system.sh
      5. Run the ESM install as the “arcsight” user
        • ./ArcSightESMSuite.bin
      6. Download content from the HPE ArcSight Marketplace at https://saas.hpe.com/marketplace/arcsight
      7. Install your ESM 6.9.1 console on Windows, Linux or Mac OS X .. although the web interface is much richer in the last couple releases, you’ll still need to use the console for content creation and editing.
      8. Optionally extend the session timeout period for the web interface.  There still isn’t an easy setting to do this in the GUI, so get into command line on your ESM server and edit or add the following lines .. which indicate the timeout period in seconds.  The default is around five (5) minutes. You should be able to edit these configuration files as the “arcsight” user, but I typically restart the services as “root”.
        • Edit /opt/arcsight/manager/config/server.properties
        • service.session.timeout=28800
        • Edit /opt/arcsight/logger/userdata/logger/user/logger/logger.properties
        • server.search.timeout=28800
        • Restart the ESM services .. I typically run this as “root”
        • /etc/init.d/arcsight_services stop
        • /etc/init.d/arcsight_services start
      9. Optionally configure the manager to display a static banner at the top of each console interface so you can have multiple consoles open and know what manager each is connected to (cool!):
        • Edit /opt/arcsight/manager/config/server.properties and add server.staticbanner.* properties (backgroundcolor, textcolor, text). Both backgroundcolor and textcolor take black, blue, cyan, gray, green, magenta, orange, pink, red, white, yellow as acceptable arguments. Text is the identifier you would like that manager to display, such as “super-awesome-production-box”
        • server.staticbanner.textcolor=green
        • server.staticbanner.backgroundcolor=black
        • server.staticbanner.text=esm691
        • Restart the ESM manager service .. I typically run this as “root”
        • /etc/init.d/arcsight_services stop manager
        • /etc/init.d/arcsight_services start manager
      10. If you are going to install any SmartConnectors on the system hosting your Enterprise Security Manager, check out my post regarding required libraries for CentOS and RedHat, before you try to run the Linux SmartConnector install. This includes any Model Import Connectors (MIC) or forwarding connectors (SuperConnectors).

      BlockSync Project

      Welcome to the BlockSync Project

      This project aims to provide an efficient way to provide mutual protection from deemed bad actors that attack Internet facing servers. The result will be an open source set of communication tools that use established protocols for high speed and light weight transmission of attacker information to a variable number of targets (unicasting to a possibly large number of hosts).

      Background

      There are many open source firewall technologies in widespread use, most based on either packet filter (pf) or netfilter (iptables). There is much technology that provides network clustering (for example, OpenBSD’s CARP and pfsync; netfilter; corosync and pacemaker), however it’s difficult for disparate (loosely coupled) servers to communicate the identity of attackers in real time to a trusted community of (tightly coupled) peers. Servers or firewalls that use state-table replication techniques, such as pfsync or netfilter, have a (near) real-time view of pass/block decisions other members have made. There needs to be a mechanism for loosely coupled servers to share block decisions in a similar fashion.

      Our goal is to create an open source tool for those of us that have multiple Internet facing servers to crowd source information that will block attackers via the firewall technology of choice (OpenBSD/FreeBSD pf/pfSense, iptables, others).

      Project Page

      All project files are still private yet, but when we publish to GitHub or SourceForge, this section will be updated.

      Funding

      We have published a GoFundMe page to acquire more lab equipment here at gofundme.com/BlockSync

      Using the ArcSight ESM Console to Create Replay Files

      HP ArcSight Enterprise Security Manager (ESM) has some built-in capabilities to generate event files suitable for use with the ArcSight Test SmartConnector.  These replay files can be used to test functioning of new ESM content (Dashboards, Datamonitors, Filters, Rules, Queries, Trends, Reports, etc).  The Test connector has some very powerful features including the ability to replay the captured data as is, or to update the date/time stamp on each event to make the data appear as current versus historical data.  The Test connector can also run multiple replay files into it’s configured destinations simultaneously and at a variable rate suitable to support initial content development as well as high speed, high volume performance testing.

      Preparing to Generate Replay File

      There are multiple ways to generate replay files, but in this post we will focus on use of the ESM console application software to generate the replay file from selected events already existing in the ESM instance.  In order to constrain the events to a selected subset, we need to have a filter prepared to chose the appropriate events.

      1-ReplayFileGen  2-ReplayFileGen

      For this example, a filter named router4 will be used, where it simply selects all events that have been generated by device name router4 or device address 10.20.1.27

      Generating the Replay File

      On the workstation or system where the ESM Console software is installed, start the replay file generator with a replayfilegen argument to the arcsight script in the bin directory.  If the console is installed on Linux or Mac OS X, simply use ./arcsight replayfilegen as the command.

      0-ReplayFileGen

      When the replayfilegen tool starts, it will display a GUI that allows the user to select the target filename to be generated, the timeframe to query and the filter to select the event data.

      3-ReplayFileGen

      Note that a relative time frame may be specified by using relative start and end time operators – these will calculate the absolute time frames needed.

      4-ReplayFileGen

      Once the collection has started, there will be a progress display showing the generation of the replay file.

      5-ReplayFileGen

      Deploying and Using the Replay File

      Now the replay file has been generated, the user can simply copy the file to the current directory of the Test SmartConnector. There can be multiple replay files in the current directory and all will be displayed when the Test connector GUI starts.

      6-ReplayFileGen

      The user can select which replay files are to be read and events forwarded to the Test connector destinations.  Any or all of the replay files may be selected, making the Test connector ideal for assisting in content development for multiple use cases.

      7-ReplayFileGen

      Once the desired replay files are selected, the events will be replayed to the configured destinations at the rate specified by the user, as soon as the Continue button is pressed.

      8-ReplayFileGen

      The Test connector will run through all the event data in each selected replay file and stop. By default there will only be one pass through the data files and no event data is altered. ESM Manager Receipt Time will show the current date/time however the original timestamps will be present in the event data.  The event rate can be changed dynamically while the replay is in progress, so for example, some basic event data could be played to the destinations for some time then the user could adjust the event rate substantially higher to speed the event ingest to the destinations.  This is useful for testing use cases where there may be denial of service or worm outbreak detection that is sensitive to event rates.

      There are many run-time options that can be set for the Test Connector, including the ability to loop on the replay files, replay the event data with current time stamps and other event handling options.

      ESM ActiveList Import Script

      <shamelessly copied from Konrad Kaczkowski’s post on iRock>

      ESM Active List Import script – arc_import_al.py

      Version 20

      Active List import script (PYTHON) – Version 0.6

      !!!!! THIS SCRIPT DOES NOT VALIDATE CORRECTNESS OF IMPORTED CSV !!!!!

      Fixed special character encoding in active list import over XML (tested on symantec GIN source adv_ip URLs)

      Symbol Description ArcSight Active List MAP in XML
      Double quotes (or speech marks) &quot;
      & Ampersand \A
      + Comma \C
      < Less than (or open angled bracket) \L
      > Greater than (or close angled bracket) \G
      \ Backslash \\
      | Vertical bar \|

       

      Fixed temporary files removing from /tmp directory – if AL was huge can use all /tmp space

      Fixed verification of access to archive.log [ tree = ElementTree.parse(TEMP_FILE) …  IOError: [Errno 2] No such file or directory: ‘/tmp/AL_IN_ESM_INVALID’ ]

      Fixed TEMP_FILE access verification – if no write rights generate new variable for TEMP_FILE

      Things to add:

      • check capacity of Active List and compare to import file
      • check activelist.max_capacity and activelist.max_columns from server.properties
      • check activelist.max_capacity and activelist.max_columns from server.default.properties

      THIS SCRIPT IS AFTER BETA TESTS on RedHat 6.5 with Python 2.6

      Test scenario at the end of post

      How does it work:

      • check if import csv file exist
      • check connectivity with ESM (validate if available, if password is correct and account is not vlocked)
      • check if Active List exist on ESM  [ use /opt/arcsight/manager bin/arcsight archive -action export command ]
      • check if number of columns from Active List is the same as number of columns from csv file
      • prepare xml file/files to import
      • import xml file   [ use /opt/arcsight/manager/bin/arcsight archive -action import command ]
      • if syslog server is specified send CEF events to syslog server
      • if option -c was set – delete successfully imported files – otherise change name to *.xml.done

      Execution:

      ./arc_import_al.py -r 20 -l “/All Active Lists/BCC/al_IP” -f /opt/asset_import/al_IP.csv -m ManagerName -u UserName -p UserPass -s 10.0.1.33 -P 514 -d -c

      where parameters are:

      REQUIRED

      -r 10                      [ numers of rows per single import ]
      -l Actve List           [ avtive list full URI in format “/All Avtive Lists/customer/malware” ]
      -f filename             [ if file contains space – use filename in ” QUITAS ” ]
      -m ESM manager   [ HP ArcSight ESM manager FQDN ]
      -u ESM user          [ HP ArcSight ESM import user ]

      OPTIONAL

      -p ESM user pass  [ HP ArcSoght ESM user password ]
      -s Syslog Server    [ Syslog server ]
      -P Syslog Port       [ Syslog server port ]
      -c                          [ clean (delete) imported files ]
      -d                          [ debugging – display detailed information from processing ]

      ADDITIONAL PARAMETERS

      -h  [ help ]
      -v  [ version ]

       

      # Possible reconfiguration options:
      #
      # Place where are stored xml files for import: line 66
      # export_dlobal_dir = “/opt/asset_import/active list
      #
      # Device interface name: line 89
      # CEF_dvc = get_ip(‘eth0‘)

       

      Test scenarios

      Test scenario 1:

      – Active List 1 [ size: 400000, columns: 4, Type: Event-based ]
      Import rows: 331776
      Batch size ( -r ) : 100000
      Time of import :
      – processing time: 20 s
      – importing: 4 x 12 s

      Test scenario 2:

      – Active List 2 [ size: 1200000, columns: 1, Type: Field-based ]
      Import rows: 1100000
      Batch size ( -r ) : 200000
      Time of import :
      – processing time: 95 s
      – importing: 6 x 45 s

      When Batch Size [ -r ] was set to 300k import failed.

      Below ESM Active Channel

      ESM ActiveChannel

      Download arc_import_al.py

      How To Increase ArcSight ESM Command Center GUI Timeout

      In the appliance versions of most ArcSight products, there is the ability to set the user session timeout period. Typically this defaults to somewhere between five (5) and 15 minutes – good for a default but incredibly annoying for any real user.  In ArcSight Enterprise Security Manager (ESM), there is no such GUI configuration that allows modification of the user session timeout – so this is what has worked for me:

      Set ArcSight Command Center (ACC) timeout greater than 900 seconds (15 minutes) – set to 28800 seconds (8 hours)
      vi /opt/arcsight/manager/config/server.properties
      service.session.timeout=28800
      /sbin/service arcsight_services stop all
      /sbin/service arcsight_services start all

      Default is 600 seconds = 5 minutes.

      In 6.5, 6.5.1 and 6.8 you also need to add the following for the Logger interface in ESM:

      vi /opt/arcsight/logger/userdata/logger/user/logger/logger.properties
      server.search.timeout=28800
      /sbin/service arcsight_services stop all
      /sbin/service arcsight_services start all

      Default is 600 seconds = 5 minutes.

      Yes, eight (8) hours may seem like a long time, so chose what is appropriate for your site.  🙂