Investigations of the RTEMS Release Notes Generator
The current state of the ReleaseNotesGenerator (RNG) code is that it dynamically fetches all the tickets from trac
, and processes them locally to generate a Markdown format of the tickets. The process of fetching the tickets is quite slow, and in this post, we will explore how the generator work and investigate its performance issues.
Currently, RNG consists of three main components:
- Trac API interfacing utilities for ticket fetching
- High-level reports generation utilities
- Markdown format generator
For this year's GSoC project, we aim to fix the Markdown generator, provide another generation interface for reStructuredText, and internally generate proper PDF release notes. The repository that will hold the code throughout the summer is here.
Operation
Every change in the RTEMS project should have an issue and be assigned to a milestone and a version number. The release notes generator, first of all, fetches metadata about all the tickets relevant to a provided milestone.
Ticket fetching
t = tickets.tickets(milestone_id=args.milestone_id)
t.load()
tickets_stats = t.tickets
The tickets.load()
function is responsible for all the fetching logic. By taking a closer look:
def load(self):
# Read entire trac table as DictReader (iterator)
tickets_dict_iter = self._get_tickets_table_as_dict()
self._pre_process_tickets_stats()
# Parse ticket data
for ticket in tickets_dict_iter:
print('processing ticket {t} ...'.format(t=ticket['id']))
self.tickets['tickets'][ticket['id']] \
= self._parse_ticket_data(ticket)
self._post_process_ticket_stats()
First, a CSV iterator that holds data skeletons of all the tickets for the provided milestone. That's being computed using the Trac query API. For example, issuing a request to the following URL: https://devel.rtems.org/query col=id&col=summary&milestone=4.11.3&format=csv
fetches a CSV-formatted id and summary for all tickets related to the milestone 4.11.3
.
RNG fetches the following attributes, initially, about all tickets:
Ticket ID, Summary, Milestone, Owner, Type, Status, Priority, Component, Version, Severity, Resolution, Time, Change time, Blocked by, Blocking, Reporter, Keywords, CC
After that, for each ticket, we call _parse_ticket_data()
, which is responsible for, mainly, two tasks:
- Categorizing the ticket by status, owner, type, priority, component, severity, reporter, and version.
- Fetching further, more specific data for the ticket (comments, description, and attachments)
After processing all tickets individually, _post_process_ticket_stats()
is called, which finalizes the statistics by computing total percentages for all numbers in the collected categorization.
Markdown generation
At this stage, all needed information about the tickets is stored and various important statistics are computed. Hence, we are ready to generate the Markdown reports.
md = markdown.markdown()
reports.gen_overall_progress(tickets_stats['overall_progress'], md)
reports.gen_tickets_stats_by_category(tickets_stats['by_category'], md)
reports.gen_tickets_summary(tickets_stats['tickets'], md)
reports.gen_individual_tickets_info(tickets_stats['tickets'], md)
The reports
module provides a unified interface for any generator to provide format-specific sections of the release notes file. Here, a markdown
generator class is used as a parameter.
Currently, the main four sections of the release notes report are the following.
Section name | Description |
Overall progress | How many tickets are in the report, and how many of them are closed and in progress |
By Category | Categorizing the tickets by various attributes (owner, type, priority, etc) |
Summary | Includes the summary of all tickets |
Individual tickets info | Description, comments, and attachments (if any) for individual tickets |
Performance investigation
After showing an overview of the operation of RNG, let's investigate its performance. By running the generator, it's glaring that "processing tickets" is extremely slow. So let's run the generator and attach cProfiler
to it to know where it spends the most of the running time. Initially, one would think that tickets are processed sequentially, and hence we could utilize a sort of parallelism to process more than one ticket at a time, so let's profile the code.
python -m pstats ./profile_output
Welcome to the profile statistics browser.
./profile_output% sort tottime
./profile_output% stats
Wed Jun 15 19:15:05 2022 ./profile_output
184181 function calls (182827 primitive calls) in 209.035 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
101 77.293 0.765 77.296 0.765 {built-in method _socket.getaddrinfo}
101 63.721 0.631 63.721 0.631 {method 'do_handshake' of '_ssl._SSLSocket' objects}
114 35.334 0.310 35.334 0.310 {method 'read' of '_ssl._SSLSocket' objects}
101 31.987 0.317 31.987 0.317 {method 'connect' of '_socket.socket' objects}
722 0.044 0.000 0.044 0.000 {built-in method __new__ of type object at 0x910fa0}
1747 0.025 0.000 0.090 0.000 tickets.py:147(_remove_tags)
1758 0.024 0.000 0.024 0.000 {method 'feed' of 'xml.etree.ElementTree.XMLParser' objects}
695 0.024 0.000 0.025 0.000 markdown.py:40(gen_line)
1758 0.024 0.000 0.049 0.000 /usr/lib/python3.9/xml/etree/ElementTree.py:1334(XML)
50 0.023 0.000 0.885 0.018 {method '_parse_whole' of 'xml.etree.ElementTree.XMLParser' objects}
101 0.014 0.000 0.026 0.000 /usr/lib/python3.9/ssl.py:1298(_real_close)
101 0.014 0.000 0.014 0.000 {method 'write' of '_ssl._SSLSocket' objects}
323 0.013 0.000 0.024 0.000 markdown.py:60(gen_table)
...
This shows that the majority of the running time is spent establishing connections to Trac and fetching the data.
Using this graph visualized by SnakeViz
, we can see that _parse_ticket_rss()
and _parse_ticket_csv()
cause the most overhead. The call to _parse_ticket_csv()
for individual tickets is only useful because it fetches and parses the description of a ticket. We already get all other "meta" data we need by the milestone identifier before delving into individual tickets. We could save a substantial amount of time by eliminating the call of _parse_ticket_csv()
. However, Trac doesn't allow fetching the description of a ticket with the filtering query used and described in the Operation section. Hence, I think the current code will suffice for now in terms of performance, and I will update it if there happens to be a way to fetch all the metadata for a ticket at once in the preprocessing stage.