reactiveLabs

Speeding Up Logstash Data Enrichment With Memcached

I was working on a feature to enrich some log events with extra data. In this case, the session setup event has most of the info (client, server, etc) and the subsequent events only contain info about the event and the session ID. I wanted to de-normalize all of the session info so I could facet along those dimensions in Kibana.

My first attempt used the elasticsearch filter plugin in logstash to lookup the session setup event:

1
2
3
4
5
6
7
8
9
10
elasticsearch {
  id => "populate-session-fields"
  hosts => ["notarealhost.reactivelabs.com:9200"]
  index => "index-name-*"
  query_template => "/etc/logstash/get-sessioninfo-query-template.json"
  fields => {
    "host_ip" => "host_ip"
    "server_ip" => "server_ip"
  }
}

Hmm, it works but is fairly slow at ~150 events per second. Looking at the logstash pipeline analyzer shows the elasticsearch filter is easily taking several orders of magnitude longer than the next slowest filter. My second try used memcached to set/get the session info.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
fingerprint {
  id => "fingerprint-source-sessionid"
  source => ["source", "session_id"]
  concatenate_sources => true
  target => "memcache_fingerprint"
  method => "SHA256"
  key => "MyProjectName"
  base64encode => true
}

if [message] !~ "SessionEvent" {

  memcached {
    id => "get-memcache-sessioninfo"
    get => {
      "%{memcache_fingerprint}-host_ip" => "host_ip"
      "%{memcache_fingerprint}-server_ip" => "server_ip"
    }
  }

  if ![host_ip] {
    elasticsearch {
        id => "populate-session-fields"
        hosts => ["notarealhost.reactivelabs.com:9200"]
        index => "index-name-*"
        query_template => "/etc/logstash/get-sessioninfo-query-template.json"
        fields => {
          "host_ip" => "host_ip"
          "server_ip" => "server_ip"

        }
    }

    memcached {
      id => "set-memcache-sessioninfo-notfound"
      ttl => 3600
      set => {
        "host_ip" => "%{memcache_fingerprint}-host_ip"
        "server_ip" => "%{memcache_fingerprint}-server_ip"
      }
    }
  }
}
else {
  memcached {
    id => "set-memcache-sessioninfo-prepop"
    ttl => 3600
    set => {
      "host_ip" => "%{memcache_fingerprint}-host_ip"
      "server_ip" => "%{memcache_fingerprint}-server_ip"
    }
  }
}

Boom! Events are processing 10x faster (~1500 events per second) with minimal changes.

A couple tips for working with this plugin:

  • Don’t use a field name in the namespace config item as it will just use it verbatim and not dereference it.
  • If the key name is getting resolved incorrectly, you will start getting weird results where the wrong session info is used. You can dump the keys from memcached if you are trying to figure out if the plugin is using the name you expect.
1
2
3
4
5
6
#connect to memcached
telnet localhost:11211
#get a list of slabs
stats items
#pick one of the numbered slabs (ex: 2) and dump the 100 first items
stats cachedump 2 100

Note: Elastic recently released Logstash 6.6.0 and tout the memcached filter plugin as a new feature. Don’t call it a comeback, it’s been around for more than a year!

How to Tell if Your PERC H730 Is About to Grenade a VSAN Host

The PERC H730 has been supported by VSAN for many releases but has a rocky road. There have been many firmware/driver revisions to fix many of the bugs which exist in it’s implementation of pass-through HBA mode. Some of these have even been pulled by Dell/VMware after release because of critical bugs. The latest release 25.5.5.005 seems to fix most of these. We have a fairly large estate of VSAN clusters using the H730 which we were in process of updating, so many were still running on older firmware/driver combos. One nasty bug that can crop up in the older revisions will cause the controller to lock up, causing VSAN to issue controller resets and eventually declare all disk groups on the host degraded. This will then cause you to a resync of all data that was on the host, exposing you to data loss if there is a further failure and you are running FTT=1 on any VMs. The fix of rebooting the affected host is simple and waiting for resync is simple but I’d rather avoid the whole situation in the first place. After this had happened a few times, I did some deep dive into the ESXi logs and figured out an early warning system for this bug.

Symptoms:

  • Disk begin being reset periodically. You can check for these in the DRAC Lifecycle log or ESXi log. Note that they do NOT show up in the regular DRAC system log and furthermore the Lifecycle log states these are normal events and not something to be concerned about.

ESXi log strings to look for:

  • Fast Path Status Updates
  • Online Disk Reset – ODR is by far the most reliable indicator
  • 0x10c
  • 0x10d

We set up alerts in Log Insight but something like greylog would also work great for this. The alerts trigger on ODR for 1 hosts when they exceed thresholds of 40 in one hour or 20 over 6 hours. You may need to adjust for your hosts particular number of disks/disk groups. With the warning system in place, you will get alerts usually several hours before a disk controller reset will happen. This gives you plenty of time to evacuate the host of VMs, put it into maintenance mode, and reboot it. Note that if you need to do a full evacuation of VSAN data from the host, this is probably not enough time but the short risk window introduced by the reboot is much better than the long risk window caused by the controller reset.

I wholly recommend (Dell/VMware concur) to avoid the H730 if possible for VSAN deployments on Dell nodes and use the HBA330 instead.

KDDI vs UCOM

This year at work, we were growing and running out seats to put people. We found some space nearby and began fitting it out. Part of this process was figuring out how we were going to connect the main office and expansion office.

Despite being literally next door, we were unable to run a direct fiber cable between the buildings. Microwave (Wi-fi) was not an option either as there was not a good line of sight and building management wouldn’t allow antennas on the roof. So we were left with buying some kind of service to connect the two buildings. We set about looking for services with bandwidth between 10-100Mbps and latency of less than 25ms.

A note on vendor pricing: For a lot of these vendors you can’t really use the pricing they give on the website. That’s basically a “fake price” to prevent competitors from underpricing them. If you bring in the vendor to do a presentation they will basically give you a 30-40% price in exchange for getting a firmer sales lead. If you give some budgetary estimates they may be able reduce it further.

Types of service

Optical fiber

We found a few companies that could provide a direct fiber connection between the buildings and their NOC but the price was very high. HardEther seemed to have much lower pricing than most other options of this type. They could give you a 1Gbps line for around 180,000JPY a month. Another option is KVH.

MetroE / Wide Area Ethernet

Most of the major telcos (Softbank, KDDI, NTT) offer this type of service. They usually offer different plans, like guaranteed bandwidth, guarantee with burst, or best effort. Unfortunately, the plans with even a small amount of bandwidth were usually more expensive than even the optical fiber.

VPN over Internet

Again, most of the major telcos offer this service as well as some smaller telcos (CTC, NTTPC). Basically you get a local loop using FLETs or AU Hikari and they run a VPN tunnel over it with their equipment. They can give it to you as a layer 2 switched link or a layer 3 routed link. Bandwidth/latency are strictly best effort but the price is much lower than the other options.

Service Selection

After evaluating each of the options, we decided to go with KDDI VPN service. All of the competing plans were very similar and the deciding factor for KDDI was that they published latency figures and would be able to bundle the NTT line cost into one invoice. The VPN router KDDI provided was a Fujitsu S ir-g100, capable of several hundred megabits per second of VPN throughput.

We were also going to get an Internet line as a backup in case the KDDI line went down or had problems. There are a lot of options for Internet in Tokyo but most of them will use a FLETs line as the local loop. The KDDI line was using FLETS, getting another provider that used the same FLETS network as a backup wouldn’t make much sense. As our main office was already using a UCOM for Internet access and we have had a good experience with them, we decided to get a 200Mbps line from them.

Results The KDDI line was tested after putting it in and seemed to meet the performance metrics we wanted. We saw latency of around 10ms with some temporary spikes to 70ms and throughput between 30-70mbps. However, a few weeks after users moved into the new office, we started getting performance complaints about the speed of the network. Using Smokeping Looking at the KDDI line stats over time, it had low latency and throughput on off hours but during business hours latency would shoot up, bandwidth would drop, and there was even some packet loss. If you have experience working with Windows file servers or other CIFS appliances, you know that CIFS traffic does NOT deal with high latency or packet loss very gracefully.

We tried troubleshooting with KDDI but eventually came to the conclusion it was just oversubscription of either the KDDI or FLETS network. We decided to try switching over to using VPN over the UCOM lines and got much better results. Latency was consistently under 5ms, most of the time it was less than 1ms. Bandwidth seemed to peak at around 40Mbps; we might have been able to go higher with a different router setup but this ended up being enough for our needs.

Using VPN over the UCOM lines has resulted in consistently lower latency and higher bandwidth than the KDDI line. I recommend this approach over the carrier run VPN solutions. If you have the in-house expertise/equipment to run VPN over UCOM lines, great! If you don’t, you can pay UCOM a bit more and they will rent you the equipment and manage it for you.

[NTT East]

Screenmonkey

Recently at work we had a big anniversary party with many clients attending. Management wanted to show a few different videos during the event along with some slides during a speech. They also wanted it to look professional without anything showing during the transitions. I found a software package called Screen Monkey that worked well for this. It lets you control everything from the laptop screen while outputting videos/slides to the external screen.

Presentation solutions

What are some solutions for presentations?

Amateur hour

I consider these methods appropriate only for small internal meetings.

  • Duplicate your screen with the external monitor.
    Everyone sees everything, probably the worst solution unless you’re only showing one thing and have some time beforehand to prep.

  • Extend your screen, do prep work on the laptop screen, and drag windows over to the external screen.
    A little better than the above solution but still not professional looking.

  • Embed your videos in PowerPoint or Acrobat and risk the wrath of the gods.
    I have seen many people try this only to crash and burn horribly. If you do go this route, let me know so I can bring the marshmallows.

Professional

I consider these methods appropriate for anything involving large groups of people, especially clients.

Hardware

Separate video sources (usually laptops) and some sort of external video switcher. If you will have lots of sources that won’t be under your control, like hosting a conference, go this route.

Pros:

  • Much more professional looking.
  • Might be easier/simpler to operate.
  • You are more flexible with input sources.
    • If Joe Blow walks up with his OpenOffice Impress PowerPoint, just hook up his laptop.

Cons:

  • More complexity, more equipment, more money, more setup time.

Software

Screenmonkey or some other kind of software switcher. Compared to the hardware solution above, Screenmonkey could be a great tool for conference speakers if they have to handle transitions in their presentations.

Pros:

  • Much more professional looking.
  • Doesn’t need another laptop and external switcher.
  • Lots of features. Layers, transitions, effects, scheduling, etc.
  • Free! There is a paid version but the additional features seem to be mostly for external controllers.
  • You can save/package up presentations for later use or copying around.

Cons:

  • The interface for setting up presentations can be a little clunky. An hour of playing around with it should get you familiar with it.
  • Doesn’t seem to be updated that often, maybe once every year or so.
  • File formats need to be supported.
    • Video formats supported by ffdshow, which is most, are supported.
    • Some formats aren’t supported at all, like PDF.
    • Some image formats, like TIFF, don’t show up in the file browser but they will load if you change the file type to All Files.

More info

Requirements: .NET 4.0 and K-Lite Codec Pack Full. K-Lite needs to be the full version with ffdshow installed or videos won’t work.

They have a website where you can download it as well as some youtube clips that can give you a feel for it.

Screenmonkey.co.uk

Youtube channel

Quality Alerts

If you have to deal with any kind of monitoring/alerting system, something you should be concerned about is alert quality. This is because either the people getting the bad alerts are getting on your case about it or you’re the person getting the bad alerts.

Why does alert quality matter?

When that pager goes off in the hour of the wolf, the responsible person should wake smelling a whiff of brimstone and shiver like a Ringwraith has just passed. Because bad things have happened, are happening, or are about to happen. But we can’t do that without quality alerts because they’ll just learn to ignore the alerts.

What are quality alerts?

Alerts that are

  • Actionable. Something can be done about the condition that generated the alert. This may not always mean that it can be fixed right away.
  • Targeted. The person who gets the alert is the person who can do something about it.
  • Informative. The alert should contain enough information to understand on its own without opening another console. Many SCOM management packs are fairly bad at this, especially the ones that say “See alert context for more details”.
  • Non-noisy. The alert should not cry wolf all the time. There is a trade off to be made between how quick you get alerted to a bad condition and how noisy the alerts are for “normal” transient conditions.

How can we improve alert quality?

One way would be to implement an alert quality reporting framework. Some attributes we would want in this would be:

  • Simple. No extra clients/websites to login to.
  • Quick. The minimum required response is good or bad, with additional detail possible.
  • Network agnostic. Alert recipients should be able to respond from wherever they got the alert, whether it’s their iPhone via a text message or Outlook via corporate e-mail.

A simple way to do this in SCOM would be to write a webservice and inject a link with the alert parameters in a URL. Stay tuned for more on this.

Introduction

Greetings, gentle readers. In this blog I will be writing about programming, software, and other topics of interest.