Jan 162018

Scalability, bottleneck detection and simulation/predictive analysis are some of the core requirements for the News Orchestrator DIA. The DICE Simulation tool promises that it can perform a performance assessment of a Storm based DIA that would allow the prediction of the behaviour of the system prior to the deployment on a production cloud environment. The News Orchestrator engineers are often spending much time and effort in order to configure and adapt the topology configuration according to the target runtime execution context. Introducing a tool that can perform such a demanding task efficiently would clearly increase the developer’s productivity and also facilitate their testing needs.

In order to let the Simulation tool assess the performance of the News Orchestrator DIA, the developers have first to create and properly annotate UML profiled models. Specifically, those models refer to the activity and deployment diagrams of DIA which are stereotyped using the DICE Storm profile, as can be seen in the figures that follows. Those UML models reflect the execution workflow on a cluster of Storm nodes/workers.

Figure 1: Activity diagram representing the News Orchestrator workflow.

Figure 2: Deployment diagram representing the News Orchestrator execution cluster.

The values assigned to the attributes of Storm stereotypes refer to:

  • The level of parallelism (number of execution threads)
  • The spouts/bolts average execution time (milliseconds)
  • The weight for the communication streams between the spouts and bolts
  • The grouping policies for the communication streams between the spouts and bolts

The weight refers to the number of messages that the bolt that follows consumes for generating a new tuple. The parallelism, weight and grouping parameters reflect the actual configuration of the News Orchestrator, while the average execution times have been monitored on an actual deployment of the DIA on a Storm cluster, as it is shown in the Table 1. Finally, the deployment parameters refer to the number of available cores on each Storm cluster worker machine.

For some of the bolts, determining the appropriate grouping is a bit tricky since those bolts send their output tuples to other bolts with some probability. The linking of this kind of bolts is referred as a ‘field’ stream. The values of the probabilities are determined after observing the distribution of messages in the tasks receiving a ‘field’ stream. For example, in case of the Minhash Rolling Counter bolt it was observed that the messages exchanged are equally distributed between the instances of that bolt due to the uniform distribution that follows the hash function that it performs. Given that there are 4 instances in total then a probability of 25% was set for each one.

The corresponding validation KPI of the Simulation tool refers to a prediction error of the utilization of bolts less than 30%. The comparison should be performed between the results that the Simulation tool has generated, as it is shown in Figure 3 and those monitored by the metrics interface that Storm framework exposes, at it is shown in Table 1. In this way, the predicted/simulated results were validated against the actual results that were monitored by deploying the News Orchestrator topology on a Storm cluster of 4 nodes running on the cloud.

Figure 3: Simulation tool predictions on utilization of News Orchestrator bolts.

Table 1: Storm profile attributes.

The results show that for some of the bolts (approximately more than half) the prediction error is indeed very small, less than 10%, predicting quite accurately the capacity of the bolts. What is interesting is what are the characteristics of the bolts that the Simulation tool fails to predict their capacity, or at least with an acceptable precision. There is a tricky part in the workflow of the bolts of the News Orchestrator topology: some bolts are not constantly consuming and analysing social items but instead their execution is triggered periodically whenever a predefined time window expires (that time window reflects how frequently the News Orchestrator engineers want to detect news topics, which is configured at 5 minutes). So, it was proved that for the topic detection and minhash clustering bolts the average execution time was hard to be computed, affecting thus the precision of the Simulation tool results. Additionally, some of those bolts are by nature of low utilization mostly due to the limited time they are executed compared to the total experiment time. In those cases, is also observed a deviation between the Simulation tool results and the actual monitored values that surpasses the threshold of 30% regarding the predictions.

For the ATC engineers, having a tool like the DICE Simulation tool in the stack of the tools that help them to optimize and evaluate the News Orchestrator DIA is a significant advantage. Every new feature that is added in the News Orchestrator DIA may result in an undesired imbalance with regards to the performance of the system. Being able to validate the performance impact on the DIA prior to the actual deployment on a cloud infrastructure gives the flexibility to fine tune the topology configuration in advance and take corrective actions (i.e. scaling by increasing bolts parallelism) without wasting resources (costs and efforts) that would be otherwise required by an actual deployment on the cloud.

Vasilis Papanikolaou, Athens Technology Center
José I. Requeno, Universidad de Zaragoza

Sorry, the comment form is closed at this time.