Investigations of the RTEMS Release Notes Generator

The current state of the ReleaseNotesGenerator (RNG) code is that it dynamically fetches all the tickets from trac, and processes them locally to generate a Markdown format of the tickets. The process of fetching the tickets is quite slow, and in this post, we will explore how the generator work and investigate its performance issues.

Currently, RNG consists of three main components:

Trac API interfacing utilities for ticket fetching
High-level reports generation utilities
Markdown format generator

For this year's GSoC project, we aim to fix the Markdown generator, provide another generation interface for reStructuredText, and internally generate proper PDF release notes. The repository that will hold the code throughout the summer is here.

Operation

Every change in the RTEMS project should have an issue and be assigned to a milestone and a version number. The release notes generator, first of all, fetches metadata about all the tickets relevant to a provided milestone.

Ticket fetching

t = tickets.tickets(milestone_id=args.milestone_id)
t.load()
tickets_stats = t.tickets

The tickets.load() function is responsible for all the fetching logic. By taking a closer look:

    def load(self):
        # Read entire trac table as DictReader (iterator)
        tickets_dict_iter = self._get_tickets_table_as_dict()
        self._pre_process_tickets_stats()
        # Parse ticket data
        for ticket in tickets_dict_iter:
            print('processing ticket {t} ...'.format(t=ticket['id']))
            self.tickets['tickets'][ticket['id']] \
                = self._parse_ticket_data(ticket)
        self._post_process_ticket_stats()

First, a CSV iterator that holds data skeletons of all the tickets for the provided milestone. That's being computed using the Trac query API. For example, issuing a request to the following URL: https://devel.rtems.org/query col=id&col=summary&milestone=4.11.3&format=csv fetches a CSV-formatted id and summary for all tickets related to the milestone 4.11.3.

RNG fetches the following attributes, initially, about all tickets:

Ticket ID, Summary, Milestone, Owner, Type, Status, Priority, Component, Version, Severity, Resolution, Time, Change time, Blocked by, Blocking, Reporter, Keywords, CC

After that, for each ticket, we call _parse_ticket_data(), which is responsible for, mainly, two tasks:

Categorizing the ticket by status, owner, type, priority, component, severity, reporter, and version.
Fetching further, more specific data for the ticket (comments, description, and attachments)

After processing all tickets individually, _post_process_ticket_stats() is called, which finalizes the statistics by computing total percentages for all numbers in the collected categorization.

Markdown generation

At this stage, all needed information about the tickets is stored and various important statistics are computed. Hence, we are ready to generate the Markdown reports.

    md = markdown.markdown()
    reports.gen_overall_progress(tickets_stats['overall_progress'], md)
    reports.gen_tickets_stats_by_category(tickets_stats['by_category'], md)
    reports.gen_tickets_summary(tickets_stats['tickets'], md)
    reports.gen_individual_tickets_info(tickets_stats['tickets'], md)

The reports module provides a unified interface for any generator to provide format-specific sections of the release notes file. Here, a markdown generator class is used as a parameter.

Currently, the main four sections of the release notes report are the following.

Section name	Description
Overall progress	How many tickets are in the report, and how many of them are closed and in progress
By Category	Categorizing the tickets by various attributes (owner, type, priority, etc)
Summary	Includes the summary of all tickets
Individual tickets info	Description, comments, and attachments (if any) for individual tickets

Performance investigation

After showing an overview of the operation of RNG, let's investigate its performance. By running the generator, it's glaring that "processing tickets" is extremely slow. So let's run the generator and attach cProfiler to it to know where it spends the most of the running time. Initially, one would think that tickets are processed sequentially, and hence we could utilize a sort of parallelism to process more than one ticket at a time, so let's profile the code.

python -m pstats ./profile_output 
Welcome to the profile statistics browser.
./profile_output% sort tottime
./profile_output% stats
Wed Jun 15 19:15:05 2022    ./profile_output

         184181 function calls (182827 primitive calls) in 209.035 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      101   77.293    0.765   77.296    0.765 {built-in method _socket.getaddrinfo}
      101   63.721    0.631   63.721    0.631 {method 'do_handshake' of '_ssl._SSLSocket' objects}
      114   35.334    0.310   35.334    0.310 {method 'read' of '_ssl._SSLSocket' objects}
      101   31.987    0.317   31.987    0.317 {method 'connect' of '_socket.socket' objects}
      722    0.044    0.000    0.044    0.000 {built-in method __new__ of type object at 0x910fa0}
     1747    0.025    0.000    0.090    0.000 tickets.py:147(_remove_tags)
     1758    0.024    0.000    0.024    0.000 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
      695    0.024    0.000    0.025    0.000 markdown.py:40(gen_line)
     1758    0.024    0.000    0.049    0.000 /usr/lib/python3.9/xml/etree/ElementTree.py:1334(XML)
       50    0.023    0.000    0.885    0.018 {method '_parse_whole' of 'xml.etree.ElementTree.XMLParser' objects}
      101    0.014    0.000    0.026    0.000 /usr/lib/python3.9/ssl.py:1298(_real_close)
      101    0.014    0.000    0.014    0.000 {method 'write' of '_ssl._SSLSocket' objects}
      323    0.013    0.000    0.024    0.000 markdown.py:60(gen_table)
...

This shows that the majority of the running time is spent establishing connections to Trac and fetching the data.

Using this graph visualized by SnakeViz, we can see that _parse_ticket_rss() and _parse_ticket_csv() cause the most overhead. The call to _parse_ticket_csv() for individual tickets is only useful because it fetches and parses the description of a ticket. We already get all other "meta" data we need by the milestone identifier before delving into individual tickets. We could save a substantial amount of time by eliminating the call of _parse_ticket_csv(). However, Trac doesn't allow fetching the description of a ticket with the filtering query used and described in the Operation section. Hence, I think the current code will suffice for now in terms of performance, and I will update it if there happens to be a way to fetch all the metadata for a ticket at once in the preprocessing stage.