TelekomCloud DevOps team

Happy New Year

2019-01-02T11:13:00+00:00

We start the year with cleanup and refresh.

An automat OpenStack deployment for develop and test system

2014-05-07T16:00:00+00:00

When I joined Deutsche Telekom 2 years ago, I had to share a common reference test system with everyone in the rooms, including all operators and developers. This was quite troublesomes when you have new ideas to test without interfering anyone, also make sure that your experiments will not break things down and make your colleagues angry.

Figure1: a local integration test for experiment on new features

Like any development process, a local integration test system is required. It must support developers editing and debugging OpenStack on the fly, as well as operators or package-manager testing a release. It’s also nice to reset the test system from dirty changes and provision it again as fast as possible. This post introduces such system and now available upstream on our repository.

1. Overview

Figure 2: deployment of OpenStack by vagrant

Vagrant is responsible for bringing the VMs up (step 1), setting up host-only networks within Virtual Box (step 2) and install base packages. From now on there are two ways to deploy OpenStack depends on your needs (step 5). For development purpose, OpenStack is deployed by devstack. For testing release packages, puppet is in use. The two deployments are configurable in a global file. Their git modules are authenticated (in case your company requires ssl for connection) and drop-in the vm for deployment (step 4).

From my personal use case, I always need to switch between the 2 deployments: devstack for coding and puppet for testing packages. Switching between the two is also supported to keep the previous deployment save, separated and reuse.

    # set environment in config.file, i.e puppet or devstack:
    env: puppet

1.1 Networking

Back to that time I only found projects that deploy all OpenStack components in one VM. This does not satisfy our needs because the all-in-one deployment does not reflect the behavior of the GRE data network within different OpenStack components. Figure 2 above shows multi nodes: control, compute and neutron node along with the 3 host-only networks for management, data GRE, and public network are brought up automatically.

    # supports multi nodes to enable/disable
    $ vagrant status
    Cache enabled: /home/tri/git/vagrant_openstack/cache
    Working on environment deployed by puppet branch icehouse...
    compute2.pp               disabled in config.yaml
    neutron2.pp               disabled in config.yaml
    Current machine states:
    puppet                    not created (virtualbox)
    control.pp                not created (virtualbox)
    compute1.pp               not created (virtualbox)
    neutron1.pp               not created (virtualbox)

In such testing environment, we also need to test the floating ips of the VMs over the public network, because it would be extremely boring if the nova booting VMs cannot connect to the Internet. For this reason, figure 3 shows how packages from inside the neutron node go out and back. Packages coming from br-tun, br-int, go to br-ex on neutron node, are forwarded to the NAT interface (vboxnet0) and SNATed so that they can find the way to go back.

Figure 3: SNAT for testing floating ips

1.2 Storage

For a simple nova volume setup, iSCSI is chosen by default. The VBoxManage command [3] is very useful in this case for our vagrant to create a virtual storage and attach to control node. For those who interests in coding, the VBoxManage command is as follows

    # create virtual storage
    VBoxManage createhd --filename <vdi> --size <cinder_storage_size>
    # and attach it to a vm
    VBoxManage storageattach <vm_id> --storagectl "SATA Controller" --port 1 
    --device 0 --type hdd --medium <vdi>

The system also formats the new virtual storage, and creates a volume group cinder-volumes for cinder. It’s also worth to mention that, in order to keep the data separated for 2 deployed environment, two separated virtual storages have to be created for each environment.

2. Deployment environments

2.1 puppet

A VM puppetmaster is up with puppetdb installed. It pulls manifests from a configurable git repository to a directory inside the vm and use these manifests to deploy OpenStack on the other VMs. By default manifests in 4 is provided as an example to try out the new Icehouse release with ML2 plugin/l2 population. You can also provide your own manifests by configuring a puppet repository and which site.pp to use for the nodes definition:

    puppet_giturl: git@your.repository.git
    puppet_branch: your_branch
    puppet_site_pp: manifests/your_site.pp

2.2 devstack

I like the deployment whereby provisioning script is provided directly inside the vm. For this reason, puppet master for deployment devstack is not necessary. Insteads devstack is directly cloned and setup inside all VMs. It is also config to use the .pip repository of OpenStack [3]. Pydev is also included in the VMs for remote debugging from the host machine supported.

Figure 2: Remote debugging with MySQL Workbench & Eclipse

3. Performance boost

One issue is the long deployment time, especially if you have a slow connection or connection drops in the middle of the deployment. So I tried out all tiny possibilities to reduce the time consuming.

3.1 Caching

When a VM is destroy and up again, it must download all packages from scratch. A simple solution for caching is implemented which cuts the deployment time by half. It’s even more faster for a second deployment, since all packages and the glance image are cached for further use so internet access is not necessary.

Caching is supported for both environments: all .deb packages installed by puppet, as well as all .pip packages installed by devstack are cached and shared between VMs. The tables below just gives a clue how much time we can save for bringing up the machines with cache enabled with a pretty fast internet download speed (4Mbit/sec), each vm 1 cpu and 1024 ram.

a) Puppet deployment in secs

Nodes	no cache	with cache
Control	312	227
Compute	110	83
Neutron	109	62
Total	532	230

win 5 min

b) Devstack deployment in secs

Nodes	no cache	with cache
Control	764	655
Compute	764	341
Neutron	224	208
Total	1754	660

win 18 min

To test a custom package, simply replace it under the cache folder and bringing up new VMs.

3.2 Customize vagrant box

To reduce the vagrant up time, a vagrant box is customized with packages pre-installed. The box is based on precise64 with packages such as VBox Guest Additions 4.3.8, puppet, dnsmasq, r10k, vim, git, rubygems, msgpack, lvm2 pre-installed. The box is also zero out all empty spaces and white out all logs to have a minimum size as possible (378 Mb) to distribute on vagrant cloud. This again cuts down 70 secs for each vm up (from 79 secs to 8 secs).

win 4.6 min (4 VMs x 70 secs)

REFERENCES

Ceph Performance Analysis: fio and RBD

2014-02-26T22:42:00+00:00

With this blog post we want to share insights into how the Platform Engineering team for the Business Marketplace at Deutsche Telekom AG analyzed a Ceph performance issue. Ceph is used for both block storage and object stroage in our cloud production platform.

Background

While the most common IO workload patterns of web applcations were not causing issues on our Ceph clusters, serving databases or other IO demanding applications with high IOPS requirements (with 8K or 4K blocksize) turns out more challenging.

Recently, we got a report from a colleague discussing performance regressions on our Ceph cluster for rados block devices. We were presented results of dd if=/dev/zero of=/dev/vdx bs=1k count=100M.

We were not very happy about getting a report with blocksize of 1k, synthetic sequential writes and /dev/zero as input. But it turned out that even with 4k or 8k, we didn’t get great numbers.

The cluster at that time was dealing fine with 32k and higher blocksizes. But 32k and less indeed resulted in a performance regression compared to an older generation of our Ceph cluster deployment.

Analysis

First we reproduced the issue with fio inside an OpenStack instance with a Cinder-attached RBD, even with pure random write. Sequential and random reads of the entire RBD were not affected inside the VM.

Also, the average results of rados bench were not perfect but not bad either - lets say they were OK. But not a good indicator if it was a pure Ceph problem or maybe something else within our network.

Initially we spent some time by analyzing with tcpdump and systemtap scripts the traffic librbd was seeing. Indeed we found situations were the sender buffer of the VM host got full while performing 4K random write stress loads inside a guest. (For this we used the example systemtap script: sk_stream_wait_memory.stp)

But this only happened on very intense 4k write loads inside the VM.

Was the IO-pattern sane to tests? Corner case issues?

We decided to look for the right tool to measure detailed latency and IOPS, which is able to reproduce the same load pattern. Bonus point: replaying real-life workloads, that come close to production workload - so we can tune for the right workload (and not for dd bs=4k).

Here started the challenge, since we looking for a tool, that is able to generate the same load on each of the following layers:

inside the VM guest (to test the RBD QEMU driver, which is using librbd)
on the VM host (using the same code: librbd. We hesitated to use the RBD kernel module to not miss potential issues inside librbd - if any)
on the Ceph storage node: testing the ceph-osd code (FileStore implementation. Covering OSD-disk and Journal-disk writes)
on the Ceph storage node: testing the filesystem (XFS) with the used formated options and mount options
on the Ceph storage node: testing the dmcrypt block device (yep, we do this.)
on the Ceph storage node: testing the block device directly (through storage/RAID controller. RAID0, write-through)

We would have to use different tools that might produce different workloads, which could lead to different results per test on different layers.

THE right tool: `fio`

fio was pretty much the perfect match for our cases - it was only missing the capability to talk to Ceph RBD and the Ceph internal FileStore directly.

Fortunately, fio is supporting various IO engines. So we decided to add support for librbd and for the Ceph internal FileStore implementation, to have a artificial OSD processes to benchmark the OSD implementation via fio.

`fio` `librbd` ioengine support

With the latest master you can get fio librbd support and test your Ceph RBD cluster with IO patterns of your choice, with nearly all of the fio functionality (some are not supported yet by the RBD engine - not fio’s fault. Patches are welcome!). All you need is to install the librbd development package (e.g. librbd-dev or librbd-dev and dependencies) or have the library and its headers at the designated location.

$ git clone git://git.kernel.dk/fio.git
$ cd fio
$ ./configure
[...]
Rados Block Device engine     yes
[...]
$ make

First run with `fio` with `rbd` engine

The rbd engine will read ceph.conf from the default location of your Ceph build.

A valid RBD client configuration of ceph.conf is required. Also authentication and key handling needs to be done via ceph.conf. If ceph -s is working on the designated RBD client (e.g. OpenStack compute node / VM host), the rbd engine is nearly good to go.

One preparation step left: You need to create a test rbd in advance. WARNING: do not use existing RBDs which might hold valuable data!

rbd -p rbd create --size 2048 fio_test

Now one can use the rbd engine job file shipped with fio:

./fio examples/rbd.fio

The example rbd.fio in detail looks like this:

######################################################################
# Example test for the RBD engine.
#
# Runs a 4k random write test agains a RBD via librbd
#
# NOTE: Make sure you have either a RBD named 'fio_test' or change
#       the rbdname parameter.
######################################################################
[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=rbd
rbdname=fio_test
invalidate=0    # mandatory
rw=randwrite
bs=4k

[rbd_iodepth32]
iodepth=32

This will perform a 100% random write test across the entire RBD size (will be determined via librbd), as Ceph user admin using the Ceph pool rbd (default) and the just created and empty RBD fio_test with a 4k blocksize and iodepth of 32 (numbers of IO requests in flight). The engine is making use of asynchronous IO.

Current implementation limits:

invalidate=0 is mandatory for now. The engine just fails without this for now.
rbd engine will not cleanup once the test is done. The given RBD is filled up after a complete test run. (We use this to make prefilled tests right now. And recreate the RBD if required.)

Results

Some carefully selected results from one of our development environments while investigating on the actual performance issue.

Following plot shows the original situation we were facing. Result from the fio example RBD job run with a 2GB RBD which was initial empty:

After analysis on the Ceph OSD nodes with fio FileStore implementation and more analysis and tuning we managed to get this result:

More detailed analysis and result and configuration/setup details will follow with the next postings on the TelekomCloud Blog.

Conclusion

fio is THE flexible IO tester - now even for Ceph RBD tests!

Outlook

We are looking forward to going into more details in the next post on our performance analysis story with our Ceph RBD cluster performance.

In case you are at the Ceph Day tomorrow in Frankfurt, look out for Danny to get some more inside about our efforts arround Ceph and fio here at Deutsche Telekom.

Acknowledgements

First of all, we want to thank Jens Axboe and all fio contributors for this awesome swiss-knife-IO-tool! Secondly, we thank Wolfgang Schulze and his Inktank Service team and Inktank colleagues for their intense support - especially when it turned out pretty early in the analysis, that Ceph was not causing the issue, still they teamed up with us to figure out what was going on.

References

http://git.kernel.dk/?p=fio.git;a=summary

New Ways of Tempest Stress Testing

2013-09-11T00:00:00+00:00

Overview

There are many stress test frameworks for OpenStack that are all pretty similar in nature. They follow fixed scenarios and fork many worker processes. Often they are difficult to enhance, since they were written for a single purpose.

With blueprint stress-test the community of Tempest developers focused to build a single and very flexible stress test framework, inside of Tempest.

A Stress Test is not a Test Domain

In the past, stress tests had their own area inside Tempest. New tests were introduced mainly as clones of an exiting API, a scenario test or a mixture of both. But do stress tests really have their own testing domain?

Two main purposes of stress test are obvious:

Having a framework to find/reproduce race conditions
Having a framework to simulate real-life load with a mixture of load profiles

Those two topics are already covered by Tempest tests today: grouping API test enables us to detect race conditions, using scenario test, we can simulate load profiles.

Tempest Stress Test Core

The core of the stress test framework is quite simple: It’s responsible for forking worker processes and summarizing results. How many processes should be forked is configurable in a JSON configuration file, which also can provide multiple arguments for each stress test.

Integration of Existing Tests

In order to stop duplication code, we started to integrate existing Tempest tests into the stress test framework. A wrapper was build to call any kind of unit test and make it available to the framework. With that, it’s easy to group existing tests. Here is an example of how this is done for two unit tests:

[{"action": "tempest.stress.actions.unit_test.UnitTest",
  "threads": 8,
  "kwargs": {"test_method": "tempest.cli.simple_read_only.test_glance.\
             SimpleReadOnlyGlanceClientTest.test_glance_fake_action",
             "class_setup_per": "process"},
  "action": "tempest.stress.actions.unit_test.UnitTest",
  "threads": 8,
  "kwargs": {"test_method": "tempest.api.volume.test_volumes_actions.\
             VolumesActionsTest.test_attach_detach_volume_to_instance",
             "class_setup_per": "process"},
}]

This will fork in total 16 processes that will conduct glance and cinder stress testing.

The Stress Test Discovery

Test discovery is done like this:

Instead of manually adding exiting tests to the stress test framework, the next logical step was to introduce a decorator. It allows test developers to decide if a test is made available to the stress test framework. Or, to be more precise, it marks tests as being meaningful stress tests. The decorator can be used like this:

@stresstest(class_setup_per='process')
@attr(type='smoke')
def test_attach_detach_volume_to_instance(self):

It can simply get added to any existing unit test or used as the only purpose for a test. Internally it is based on the existing mechanism of attribute discovery of testtools and adds the attribute stress. It has one mandatory parameter class_setup_per, since it must be decided when the setUpClass function should be called: For every process, for every action or just globally. This depends on the content of the setUpClass and must be decided by the developer. In many cases a call on a per process level is sufficient.

All existing test attributes like smoke or gate can also be used as filter within the discovery function.

Which Tests are Good Candidates?

In fact, it’s often easier to identify test cases that aren’t good candidates:

Negative test
Single unit test function that cover only one little aspect (like listing volumes)
Tests that interfere each other (like changing quotas)

All others tests might be interesting candidates to get integrated and used from the test framework. So please feel free to identify new cases and contribute them to OpenStack/Tempest.

OpenStack Networking High Availability concept

2013-06-10T19:42:00+00:00

Getting OpenStack highly available has been a hot topic for us at Deutsche Telekom AG. With the upstream OpenStack High Availability documentation, it’s already well documented how to configure supporting services like MySQL and RabbitMQ in highly available setups.

A challenging area with regards to high availability has been OpenStack Networking. For our OpenStack Grizzly-based cloud, we have made great progress with our solution that we would like to share: how to implement OpenStack Networking (L3 agent) highly available in active/active mode without pacemaker, using traditional routing/balancing functionality.

Our primary goal is to keep VMs available/reachable form the internet with redundant network. I’ll focus on the high-level idea today and post later about the detailed implmentation.

Lets assume we have OpenStack up and running, including OpenStack Networking with Open vSwitch plug-in. The traffic will usually flow from internet -> Network Node -> Open vSwitch -> Full mash GRE tunnel -> Compute node -> Open vSwitch on Compute - >VM:

Next, we need a second Network node with Open vSwitch and GRE tunnels to connect to the same Compute nodes:

OpenStack Networking gained an important feature in Grizzly, which allow us to schedule to multiple network nodes.

Now we can create a VM on compute node with two nics and assignee two different IP addresses to them. We need to make sure to use IPs that are part of each of the subnets that are mapped to the network nodes / L3 agents.

With this, we have now multiple paths to go out from the VM. In order to avoid any kind of routing problems you need to configure PBR (Policy based routing) insdie the VM. (Details will be provided with the next post.)

The goal is to make sure that packets arriving on interface ethX will be replayed or send back via the same interface. This will require two routing tables and two default routes, one for each interface.

Having done this, we now have multiple paths to the same VM, with different IP addresses.

In case one of network nodes (L3 agents) is not available, the GRE tunnel is not up or for any other reason you cannot access the VM via the first Network node, you can still reach the same VM via the second Network node, but with different IP address.

Finally, we need to set up a load balancer in front of for the Network node to manage the IP swapping, to hide any changes to the IP addresses form the public network.

Summarizing, we now have the networking node working in active/active mode and the traffic will be load-balanced, which also allows us to double the throughput. The traffic flow now looks like this: Internet -> load balancer -> Network node 1 or 2 -> Open vSwitch -> Full mash GRE tunnel -> Compute node -> OpenVswitch -> VM:

The redundancy of the load balancer is out of scope for this post. You’ll have to choose a load balancer, that best meets your requirements. It could be a hardware appliance, or software based.

In case you don’t want to implement a load balancer in front of the network node, but still want to keep the VM highly availability, you need to take care of the IP address changing yourself. One option would be to use the IP SLA feature of CISCO routers for this. It can monitor the availability of the path to VM and switch to next path with NAT immediately, in case one path is not available:

I’m looking forward to your comments and questions!

Hello World!

2013-06-10T18:42:00+00:00

Did you know Deutsche Telekom AG has been using and developing OpenStack-based clouds for the last 1.5 years? We first publicly talked about our OpenStack efforts at CeBIT 2012 in March 2012 and at the OpenStack Folsom Design Summit in San Francisco. A lot has happened since then! While OpenStack Essex was not fully ready to meet our requirements, we have been in production with OpenStack Folsom for a while now. Currently our Cloud Development and Operations team is busy preparing the launch of our OpenStack Grizzly-based cloud.

With the creation of this team blog, we want to share our ideas and findings in and around OpenStack and discuss them with the community. We are looking forward to the conversation!