GEGLs "Mono Mixer" mit Python verwenden

Posted by Felix Schwarz on 22 mars 2019

Eine Lobeshymne auf GEGL, gobject-introspection und Fedora

Hintergrund: Rotfilter

Diese Woche konnte ich mich endlich einem Problem widmen, welches ich schon längere Zeit im Hinterkopf hatte: Einer meiner Kunden muss regelmäßig Text aus gescannten Rezepten der gesetzlichen Krankenversicherung extrahieren. Im Prinzip eine relativ einfache OCR-Aufgabe.

Allerdings enthält bereits der Blankobeleg einige Texte wie z.B. die Beschriftung der einzelnen Felder. Die Arzt- bzw. Apothekenbedruckung kann durchaus über die vorhandenen Texte gedruckt werden. Um der OCR die Aufgabe zu erleichtern, werden für die Texte des Blankobelegs Rottöne verwendet, während Arzt und Apotheker in Schwarz drucken.

Mittels eines Rotfilters können daher alle verdefinierten Texte des Belegs entfernt werden, so dass die OCR nur noch den eigentlichen Arzt- bzw. Apothekendruck erkennt. Heraus kommt dann ein Bild, welches nur noch Grautöne enthält. Bislang wird dies direkt durch die "Hardware"/Spezialscanner des Kunden durchgeführt, die dafür auch gut getunte Algorithmen verwenden.

Allerdings gibt es dabei wie immer ein paar Nachteile:

Durch den Rotfilter gehen auch Informationen verloren, so dass eine eventuell nötige manuelle Nacharbeit schwieriger wird.
In Einzelfällen werden auch kleinere Scanner benutzt, die über keinen Rotfilter verfügen.
Der wichtigste Punkt für mich ist aber die Unflexibilität des Hardwarefilters: Arzt- und der Apothekendruck können sehr unterschiedlich sein, da es sich ja um zwei verschiedene Drucker handelt. Manchmal ist einer der Drucke nur schwach zu erkennen (nahezu erschöpfte Farbbänder/Tonerkartuschen), während der andere kräftig und klar ist. Je nach Schwellwert kann es sein, dass der Rotfilter den schwachen Druck komplett entfernt. Ist der Schwellwert aber zu niedrig eingestellt, bleibt zu viel Text der Blankovorlage erhalten. Ich würde also gerne auf dem farbigen Scan bestimmte Bereiche mit unterschiedlichen Schwellwerten versehen (ggf. auch in Abhängigkeit der OCR-Ergebnisse).

Ich habe bereits vor Jahren einen primitiven Filter selbst geschrieben. Dieser funktioniert meistens „gut genug“, ist aber sehr langsam und ist nur schlecht parametrisierbar.

GEGL Mono Mixer

GEGL („Generic Graphics Library“) ist eine Bildbearbeitungsbibliothek, die ursprünglich entwickelt wurde, um Grafikoperationen für GIMP bereitzustellen. Allerdings ist GEGL in keiner Weise an GIMP gebunden, sondern kann auch völlig unabhängig davon verwendet werden.

Insbesondere stellt GEGL eine Operation namens Mono Mixer („Monochrome channel mixer“) bereit. Diese Funktion ist genau der gesuchte Rotfilter.

Allerdings ist GEGL größtenteils in C geschrieben, während die OCR ansonsten mit Python gesteuert wird. Allerdings wurde im Umfeld des GNOME-Projekts die großartige gobject-introspection Bibliothek entwickelt. Damit ist es möglich, entsprechend annotierten C-Code aus verschiedensten Sprachen wie Python oder JavaScript aufzurufen, ohne dafür spezielle C-Anbindungen schreiben zu müssen.

Der größte Nachteil ist aus meiner Sicht, dass es für GEGL nur wenig Dokumentation gibt. Insofern ist der folgende Abschnitt vielleicht auch für andere interessant :-)

Code: Rotfilter mit GEGL und Python

#!/usr/bin/env python3
# Licensed under the Creative Commons Zero v1.0 Universal
# https://creativecommons.org/publicdomain/zero/1.0/
# SPDX-License-Identifier: CC0-1.0
import os
import gi
gi.require_version('Gegl', '0.4')
from gi.repository import Gegl
Gegl.init()
graph = Gegl.Node();
gegl_img = graph.create_child('gegl:load')
gegl_img.set_property('path', 'input.jpg')
colorfilter = graph.create_child('gegl:mono-mixer')
colorfilter.set_property('preserve-luminosity', False)
colorfilter.set_property('red', 4)
colorfilter.set_property('green', 0)
colorfilter.set_property('blue', 0)
gegl_img.link(colorfilter)
sink = graph.create_child('gegl:jpg-save')
sink.set_property('path', 'output.jpg')
colorfilter.link(sink)
sink.process()
# not calling "Gegl.exit()" to suppress some warnings, see also
# - GEGL issue 142: "GEGL warning und buffer leak after loading a file"
#     https://gitlab.gnome.org/GNOME/gegl/issues/142
# Gegl.exit()

Ergebnisse

Der obige Farbfilter (mit angepassten RGB-Parametern) liefert bei mir exzellente Ergebnisse. Der neue Code ist zudem etwa 4-5x schneller und viel besser parametrisierbar, so dass die Erkennungsrate auch bei schwachem Druck besser ist.

Außerdem will ich an dieser Stelle noch mal Fedora loben: Alle nötigen Pakete sind mit einem einfachen dnf install gegl04 python3-gobject-base installiert und einsatzbereit. Fedora 29 bietet derzeit (März 2019) die relativ aktuelle GEGL-Version 0.4.12 an (veröffentlicht im Oktober 2019). Die neueste Upstream-Version ist 0.4.14, die erst vor drei Wochen (am 01. März 2019) veröffentlicht wurde.

Ascendos – A New Hope?

Posted by Felix Schwarz on 24 juil. 2011

Last week I came across a new Open Source project which looks very promising to me (and at the same time got almost no media coverage yet). Therefore I decided to blog about this right now: Ascendos.

Ascendos is will be a free rebuild of Red Hat Enterprise Linux (RHEL), technically very similar to CentOS.

I started to write a bit of technical background what is so interesting about RHEL but it became to long so you find that in a separate blog post.

Why is Ascendos so exciting to me:

The CentOS project has severe management problems and there are no signs of even acknowledging the problem.
From the mission statement on the web site it looks like Ascendos will work in a way that I always wanted from CentOS.
At least one core member of Scientific Linux participates in Ascendos (even though on personal time) and Scientific Linux is generally more open about their "secret sauce".

So it looks like the time has finally come for a serious CentOS contender. Currently it's all vaporware but I'm hoping for the best.

The Downfall of CentOS

Posted by Felix Schwarz on 24 juil. 2011

What features made CentOS successful

As you likely know, CentOS is a free rebuild of Red Hat Enterprise (RHEL). I wrote a praise on RHEL which pretty much explains why RHEL is great.

I don't think it'll surprise you that there are a couple of RHEL rebuilds:

These rebuilds ("clones" as they are sometimes named) generally take the source rpms published by Red Hat, remove Red Hat's branding (required by Red Hat legal) and recompile the packages. The end result is usually free to use.

However even "just" recompiling is quite some work even if it can be done by individuals which enough time and skill. CentOS gained a lot of users quickly and always had a couple (about 2-6) core maintainers who did a lot of work. As a consequence the creators of Whitebox Linux and Tao Linux (both were RHEL clones as well) recommended switching to CentOS when they discontinued their work.

CentOS always had the major advantage that they promised the same support period as RHEL and strived for maximum compatibility with RHEL. CentOS did not add additional packages or patches unless it was unavoidable (e.g. adding yum to update a system in CentOS 4 instead of using the Red Hat Network).

And last but not least CentOS had a community behind so the project was present on all major Open Source conferences and fares which might have increased its popularity as well.

Chronicles of a Downfall

However even with all that popularity problems began to mount up. CentOS' downfall did not start due to a major catastrophe but was more a slow but steady process. During that time the CentOS core team members failed to react appropriately and so things got worse.

Dag Wieers leaving

In June 2009 Dag Wieers left the CentOS project because of his level of frustration with the CentOS development team. This in itself didn't have a big impact on the project. However it was a clear warning sign. When a highly skilled, well known contributor leaves a project not for the lack of time but because they are not satisfied with the project itself, it means that there is something seriously wrong.

Dag is an experienced RPM packager, providing additional RPMs for Fedora and CentOS for a long time (he's also one of the guys behind RPM Fusion. From personal experience I can say that he's a nice guy, easy to deal with and for sure someone with good 'social' skills.

Now while there were no changes in the project, some of the problems Dag was frustrated about became public just a month later…

Disrupted communication within the CentOS core team

CentOS was founded by Lance Davis in 2004. However Lance stopped contributing a few years afterwards. This resulted in an exceptional situation of an Open Letter to Lance Davis in July 2009.

The public learned that Lance was the sole owner of the CentOS.org domain as well as the CentOS Paypal account. For several years he did not even disclose to other core developers how much money was available and the actual project ("project" being the people doing marketing and engineering) did not benefit at all from monetary donations.

Lance did not answer neither email nor calls even from the inner circle of CentOS developers. Tim Verhoeven even wrote: "We tried to contact Lance multiple times over a period of more then a year."

Ralph Angenent, another CentOS core member, wrote: "This means that the project depends on one person in too many ways. Add to that a person who doesn’t answer calls, isn't available as meetings, doesn't publish things he promised to do - we have a problem. And this is unacceptable. We as a project have to be more transparent. And this is one of the things blocking this."

As many users mentioned this is unacceptable and not only a failure of Lance but IMHO also a project failure because keeping quiet for long means that problems are just brushed under the carpet.

Also Dag Wieer's comment on the whole story is an interesting read and he disclosed that Lance's position and absence was one of the things which drove him to leave the CentOS project.

End of the story was that Lance showed up in a IRC meeting two days later and the issues were resolved. However at least to me as an outsider transparency did not really get better…

Release delays and missing security updates

RHEL clones can publish their updates only after Red Hat naturally. For CentOS regular updates are done in a quite timely manner. However every few months Red Hat releases 'minor updates' (similar to Windows service packages) like 5.1, 5.2, … Many CentOS pay a lot of attention to the time it takes CentOS to get their release out as the update also contains security fixes (comparions with release dates of Scientific Linux are misleading as the latter project continues to publish security fixes even when the full update is not yet released).

Wikipedia has the complete release history with the slack time relative to RHEL. Generall you can see that the slack time increases over time both for CentOS 4 and 5 which means CentOS became slower to follow up on Red Hat (the quick release of CentOS 4.9 is likely because Red Hat doesn't change much anymore for RHEL 4.x at this point in the release cycle).

This culminated in a delay of 85 days to release CentOS 5.6 which is almost 3 months without security updates for a big install base! How is that for an enterprise (CentOS) distribution?

Actually Karanbir Singh mentioned that critical/important security fixes will be published for CentOS 5.5 as updates but in that case he failed by his own standards:

Firefox and Thunderbird: CVE-2011-0051, CVE-2011-0053, CVE-2011-0053, CVE-2011-0055, CVE-2011-0056, CVE-2011-0057, CVE-2011-0058, CVE-2011-0061, CVE-2011-0062
libtiff: CVE-2011-0192, CVE-2011-0282
krb5: CVE-2011-0281
glibc CVE-2011-0536
…

As you can see on the CentOS archive there were no updates after the release of RHEL 5.6 (13.01.2011). There was even a LWN article about that: CentOS 5, RHEL 5.6, and security updates.

Compared to that the 242 days of delay to publish CentOS 6.0 is not that significant because there were no existing CentOS 6 users waiting for security updates. Though the enormous time tells something about CentOS' capacities.

Lack of Collaboration

Red Hat publishes their source RPMs which is really nice because makes rebuilding RHEL possible. However rebuilding is still not trivial because of Red Hat's packaging errors (missing build requirements for example) and non-obvious build system requirements. Scientific Linux documented the rebuild problems they encountered. Also rebuilding from scratch (and keeping compatibility with RHEL) requires to follow a specific, non-documented build order.

But despite of that core members just deny that there is any "secret sauce" (1, 2, 3, 4, …).

Questions about privately rebuilding RHEL are frowned upon on the CentOS devel mailing list (for example 1) or people are put of with superficial answers (1, 2).

In the end there is a lot of fear (of unofficial rebuilds) and of giving away their own position:

Red Hat did not tell me how to build it. (...) Why should I tell someone how to build a replacement OS to CentOS.

(Johnny Hughes). Also FUD tactics are used for excuses not to publish things.

There is one statement which summarizes the philosophy pretty well:

Our goal is not a reproducable system so YOU can build software, it is for US to produce software. If you are looking for a distribution that teaches YOU to build things, get Gentoo or Linux From Scratch."

Uh, wait. I always though that sharing knowledge and enabling people to do things on their own was at the very heart of the Free Software Movement? Well, this brings us to the next problem…

CentOS is only a community of users

Many people critizing the CentOS core team because of their attitude towards the community don't understand how Karanbir and Johnny see the CentOS 'community'.

Karanbir, January 2011:

what do you think is your association with CentOS ? do you use it ? if so, you are a part of the community already. None ever said it was BUILT by a bunch of random driveby community members. Its built for a community and around a community.

Johnny:

Community because all the QA team, the forum moderators, the graphics team, the mailing lists and the wiki are all members of the community providing help. (…) CentOS is for the community ... it is not BUILT buy the community.

As you can see, even though CentOS is officially the 'Community Enterprise OS', the term 'community' only means a community of mere users.

"My way or the highway" attitude - failure to understand the essence of voluntary contributions

Karanbir Singh, November 2010:

Lots of people will argue that open source works in a way where people do what they want to do, so you cant tell them what needs doing - and they will do what they want, when they want. Its what many imagine is the 'fun' in the open source way. Fortunately, or unfortunately we dont have that luxury. What comes down the pipe needs to be addressed, sometimes its what we want to do - and sometimes its what needs doing because that's the issue on hand. The process we have in place is mostly finite, with a specified origin and a specified delivery expectation. We need to join those dots. And if people don't want to help with that joining-the-dots effort, they are never going to be a part of the process.

So when people imply that there are lots of potential-contributers who would want to get involved and help etc : What fantasy world are they looking at ? I, or one, would like to get in on that action please.

To me this post illustrates a severe lack of misunderstanding how voluntary contributions work. The first two sentences are mostly correct. However the conclusions he draws from this ("if people don't want to help with that joining-the-dots effort, they are never going to be a part of the process.") are just shortsighted.

The essence of voluntary work is motivation. Potential contributors will have all kinds of motivation when the first try to help. It is crucial to allow them to work how they want initially. Over time some will stick with the project and develop a sense of duty so they care also about non-fun stuff. Others will contribute only one small thing but even the little things will add up as they were shared.

However if you set a process in stone and throw some more or less boring tasks with much explanation over the fence (as Karanbir did initially) it's very unlikely to attract new contributors. So in a way Karanbir is right: There are not many potential contributors – for his definition of potential contributor as someone who is willing to even boring tasks without much questioning.

This ties in "nicely" with an attitude that many people offering help want to do their own/don't care for "the project".

That attitude is absolutely killing. Getting new volunteers in a free software project is usually quite hard. The project members have to work hard (in a social sense, not technically!) to get new developers constantly. Any obstactles in that process will stop people from contributing very quickly.

Well, so it does come as a surprise that there is very few attempts are being make to speed up the release process with the help of outsiders (1) and that there are still significant delays in the release process.

At the same time I find interesting that Dag doubts that QA is the main bottleneck. Which in turn means getting more people involved in actually building CentOS might be a good thing…

Also CentOS chose a way of working which requires extra work to pass on knowledge about the build process. As Johnny wrote even in case of a missing build requirement they don't modify the source rpm.

While I certainly like that CentOS tries not to change the upstream sources I think this is a clear exaggeration: In the end the missing package is use in the build one way or another. The alternative (add the missing BuildRequires line which Red Hat should've added) is quicker (no need to change the build root/restart the build process), passes the knowledge to others (as it is recorded in the source rpm) and looks just like the right approach to the problem. Of course filing bugs in the Red Hat Bugzilla should always be done.

Another limiting factor is that CentOS is very strict about not publishing anything before they are sure that all branding was removed (Karanbir's statement) unlike Scientific Linux which also publishes betas and the like.

A good example for the lack of understanding is a conversation between Douglas McClendon and Karanbir Singh where Karanbir does not accept any of Douglas' more significant proposals related to feedback and generally dismisses most of his ideas pretty strictly ("if you believe that, you are about as far away from CentOS as can be.", "No, that's a serious waste of time and effort.", "Whats so hard to understand about that ?").

Analysis of the failures

I think the current situation can be attributed to the three following project failures:

Showing the secret sauce only to core members Details of the development process are completely intransparent: How packages are rebuilt, which special handholding is required for which package is not documented because of an attitude "this process works, we need help elsewhere". Many people and companies are interested in this (so there they are potential contributors) but the project doesn't grab them.
No strategic community building: There is no strategic effort to recruit new team members constantly which is required to keep a project alive in the long run.
Lack of trust and fear of uncontrolled work: You have to complete predefined QA tasks or something similar before you can even start thinking about fixing (maybe very small) issues. You can't even work on your own and just submit the results because you lack information from the very beginning unless you are a CentOS team member.

Is this the End of CentOS?

Now even with all my ranting about the CentOS project I don't think it is dead already. Certainly some users switched while waiting on CentOS 5.6/6.0 (mostly to Scientific Linux I think). The bad news got some media coverage which has some impact on CentOS' image but the install base is still huge. There are still people doing fares and conferences for CentOS. Karanbir Singh is still doing package rebuilding.

So the project is as alive as a year ago. Its weaknesses are just more obvious than before.

The most important thing keeping CentOS strong at the moment is that there is no real competitor right now. What is needed is a real community-based, open and transparent RHEL clone which does not add any features (like CentOS, even better if there is no such thing as "CentOS extras").

Even with such a contender CentOS will remain an important distribution as long as the key people still invest time to maintain CentOS. So currently it's are bit premature to talk about a 'downfall' of CentOS but there is for sure an eclipse…

Why you should consider RHEL+clones for your Linux machines

Posted by Felix Schwarz on 20 juil. 2011

If you're not familiar with the Red Hat Linux world, you might ask why you should bother at all looking at this enterprisy Linux. After all Ubuntu is for sure the most successful desktop Linux distribution which even has 'long-term' support (LTS releases). Alternatively Debian might be seen as a stable distribution without too much commercial influence.

Well, Red Hat enterprise Linux beats these alternatives single handed in my use cases:

7 years of security support for every major version (4.x, 5.x, 6.x) for all packages
This is already a significant advantange over Ubuntu's LTS for me because Ubuntu's LTS does not really have "long-term" support:

Packages from the universe/multiverse repositories don't have any long-term support, sometimes they are more like some quickly created packages without a community of packagers maintaining them.
desktop packages are only supported for three years, afterwards you're left alone.
And last but not least it's only 5 years support even for server packages.

I run a couple of Linux servers and even today (July 2011) I'm running 3-4 CentOS 4 boxes. CentOS 4 was released initially in 2005 and these servers now run for more than six years now without problems.

Unlike Debian installs I did not have to upgrade my install (with the need to adapt configurations because of new major versions being incompatible with the previously installed versions). All I had to do is to install updates coming in though yum.

Red Hat invests a significant amount of time to update all the old components (RHEL 4 comes with Linux kernel 2.6.9) in order to minimize disruptions. Which means I can plan very well when I do a major upgrade of my servers. Currently I'm planning to retire to the old CentOS 4 boxes but I had more than two years for the actual migration which really helps!

Fedora as a Technology Preview
At the same time I can run Fedora on my desktop or on servers which really need the "lastest and greatest" technology. As RHEL is based on (older) Fedora releases the administration is very much the same. I can create private RPM packages for Fedora and reuse them with minimal effort on RHEL/CentOS.

Fedora EPEL provides all other packages
Fedora EPEL is a Fedora subproject which provides more high-quality RPM packages for RHEL/CentOS in a transparent, reliable way. It is based on Fedora's standards, policies, and technologies so there are a lot of good packages which are even updated constantly. Usually it's quite easy to jump in and fix things and with co-maintainers it's not so much of a problem if an individual packager has no time anymore to update packages.

The high quality of these packages is also demonstrated by the fact that Red Hat often blesses EPEL packages as 'official' and distributes them as part of RHEL.

Red Hat/Fedora really take packaging seriously
It's a small point but comes in handy more often than most people think:

Each RPM has a changelog section where each packaging change is listed.
Usually packages have only the absolutely required dependencies with subpackages for specialized functionality which requires additional packages. On the other hand I found Debian sometimes to be too granular which makes it hard just installing some meaningful functionality.
Package installation/updating can be done completely unattended. Fedora's packaging policy mandates that there is no user interaction. This might be bad for a "Desktop installer"-style installation with initial configuration but it ensures that you can set up systems completely unattended.
Services are not restarted automatically after updates. I was burned once on Suse Enterprise Linux 9 when I just updated some packages while waited for all user to log of so I could start the actual maintenance. Unfortunately several services were restarted automatically which caused TCP connections to reset which in turn caused data loss for some users.

Red Hat and Fedora take Free Software and community seriously
There is very little "secret sauce", most of the stuff is done in the open. I don't remember Red Hat dictating big changes, new features is introduced through Fedora and people with enough merits can influence these decisions. Most (all?) software from Red Hat is released under a free software license.

Also Red Hat provides source RPMs publicly and does not make any attempts to prevent free "clones" like CentOS and Scientific Linux: merging all kernel patches into one big patch is not a GPL violation and does not affect the community clones but rather Oracle as a commercial free-rider.

So that's why Red Hat earns more money every year even if they 'just' publish a Linux distribution.

MediaCore – une plateforme vidéo extensible

Posted by Quentin Theuret on 6 mars 2011

Dans ces cinq dernières années, la vidéo sur Internet est devenue de plus en plus populaire. Évidemment, YouTube a une grosse part de ce marché et fournit beaucoup de fonctionnalités pour l'utilisateur. Cependant, plusieurs entreprises, associations et organisations veulent héberger leurs vidéos sur une plateforme dédiée. Les raisons que j'enetends sont le plus souvent :

Les pages doivent s'intégrer dans le site principal de la société (styles, intégration)
Faire apparaître des publicités pour gagner de l'argent
Plus de flexibilité, la possibilité d'ajouter des fonctionnalités personnalisées n'importe où

This an unofficial translation of my English blog post kindly contributed by Quentin Theuret.

À propos de MediaCore

MediaCore est un logiciel open source d'hébergement de vidéos et de podcasts, publié la première fois en janvier 2010. Le paquet de distribution contient déjà beaucoup de fonctionnalités intéressantes :

vidéos embarquées depuis YouTube, Vimeo ou Google Video
Le support des podcasts
Un service de stockage sophistiqué (e.g. héberger les médias sur le cloud d'Amazon S3)
Intégration des réseaux sociaux (partage sur Facebook, Twitter, etc)

Le principal développement est fait par Simple Station, un studio de graphisme canadien. Cependant le développement est transparent avec un répertoire github, un forum et une liste de diffusion pour les développeurs.

Une chose que vous devez garder à l'esprit avec MediaCore, c'est que ne n'est pas aujourd'hui une "solution clé en main pour tout le monde". Quelques fonctionnalités couramment demandées ne sont pas implémentées, tel que:

un encodage automatique des vidéos, (e.g. encoder les vidéos téléversées dans un format adéquat pour le visionnage web)
le système de permissions, la restriction de la possibilité de téverser, la possibilité de définir un quels utilisateurs (payants ?) peuvent voir uen vidéo
le support de la publicité incrustée dans une vidéo

Également, l'installation sur un espace web peu cher sans accès au terminal est quelque peu difficile, voir quelques fois impossible. La plupart des ces problèmes vont être corrigés avec des extensions après la sortie de MediaCore 0.9 (mi-mars 2001) ou vont fonctionner dans les futures versions. Elles ne sont simplement aujourd'hui pas présentes.

Personnalisation

Jusqu'ici, la personnalisation de MediaCore requierait toujours des changements dans le code source, ce qui rendait les mises à jour plus difficiles car vous aviez besoin de modifier le code source à chaque nouvelle version. Avec la venue de la version 0.9, MediaCore supporte les extensions, donc vous pouvez faire beaucoup de personnalisation sans modifier le code source de MediaCore. Exemples d'extensions :

Modifier le thème général de façon arbitraire en surchargeant les modèles
Ajouter de nouveaux lecteurs
Déclencher du code personnalisé sur certains événements (e.g. "nouvelle vidéo téléversée")

Avec Mediacore 0.9, vous pouvez développer vos propres modifications comme des extensions, c'est donc plus facile de mettre à jour le coeur de MediaCore sans avoir à modifier votre code. Allez voir la prochaine seection pour avoir plus de détails techniques.

Extensions pour MediaCore

Actuellement (06 mars 2011), il n'y a pas de documenation publique sur comment écrire une extension pour MediaCore, j'ai donc décidé de partager mon expérience avec vous. Cependant, ceci n'est pas un guide complet pour les nouveaux arrivants. Je considère que vous êtes familier avec les bases de Python.

Généralement, les extensions sont développées en utilisant Python (pour le code serveur), Genshi (pour les modèles) et les technologies web habituelles (HTML, CSS, Javascript) pour l'interface utilisateur. Pour le côté Python, vous pouvez également lire la page d'accueil de Pylons qui est le framework sur lequel repose MediaCore. Il y a également The Definitive Guide to Pylons (version en ligne) de James Gardner.

Structure

Premièremenet, regardez la structure d'u plugin (vous pouvez télécharger ce squelette):

"sample" est le nom de votre module Python, vous pouvez le modifier avec n'importe quel nom Python valide.
"mediacore_plugin.py" est un nom arbitraire mais j'utilie celui-ci comme un point d'entrée central où j'importe tout mon code.
Si votre extension requiert des ressources statiques comme du Javascript, des feuilles de styles ou des images, vous aurez besoin d'un répertoire nommé "public" à l'intérieur de "sample" (ou du nom que vous avez donné à votre module). Manifestement, les ressources statiques peuvent avoir des noms arbitraires, j'ai simplement ajouté un fichier CSS appelé "sample-style.css".
Si vous voulez ajouter de nouvelles pages à votre MediaCore, créez un répertoire "templates" à l'intérieur de "sample" (ou du nom que vous avez donné à votre module). À l'intérieur vous mettez vos modèles Genshi, comme "my-page.html" qui est simplement un exemple.

Chaque répertoire de votre extension doit contenir un fichier (vide) __init__.py (ceci est à cause du comportement des outils Python setuptools/zipimport). Dans le dossier racine, vous deveza voir un fichier "setup.py".

Exemple : Ajouter une nouvelle page

Pour ajouter une nouvelle page, vous avez besoin d'un contrôleur, une méthode "exposée" spécifique dans le contrôleur et le modèle. Regardez plutôt cet exemple :

# File: sample/mediacore_plugin.py
# ------------------------------------------------------------
# -*- coding: UTF-8 -*-
from mediacore.lib.base import BaseController
from mediacore.lib.decorators import expose
class MyController(BaseController):
    @expose('sample/my_page.html')
    def my_page(self, tag_name, **kwargs):
        # do your backend work here
        # …
        return {'tag_name': tag_name}
# ------------------------------------------------------------
__controller__ = MyController

Si vous utilisez cet exemple (et que vous avez remplis les morceaux manquants) dans votre extension, votre nouvelle page sera disponible sur http://<mediacore>/sample/my_page?tag_name=Something. Notez que la partie "sample" de votre URL dépend de vos paramètres dans setup.py (regader cette section sur l'installation en dessous),

Dans votre modèle vous pouvez référencer vos ressources statiques en utilisant ${h.url_for('/sample/public/<filename>')}.

Si je me souviens bien, chaque extension peut n'avoir qu'un seul contrôleur. Cependant un contrôleur peut avoir plusieurs fonctions exposées.

Exemple : Ajouter un nouveau lecteur

MediaCore supporte différents types de lecteur, plus spécialement des lecteurs HTML5 (utilisant la balise <video> et du Javascript) et des lecteurs Flash. Les deux classes importantes à voir sont mediacore.lib.players.AbstractHTML5Player et mediacore.lib.players.AbstractFlashPlayer.

Je vais vous montrer un exemple de comment construire un lecteur personnalisé basé sur FlowPlayer:

# File: sample/player.py
from mediacore.lib.players import FlowPlayer
class MyPlayer(FlowPlayer):
    name = u'my_player'
    display_name = u'Custom Player based on Flowplayer'
    # you could override flashvars() for example to add player some options

Ensuite vous devez enregistrer votre classe, en ajoutant ceci dans votre mediacore_plugin.py:

# …
from mediacore.lib.players import AbstractFlashPlayer
from sample.player import MyPlayer
AbstractFlashPlayer.register(MyPlayer)
# …

Si vous ["InstallervotreExtension avez installé] votre extension "après" avoir initialisé la base de données principale de MediaCore (paster development.ini setup-app, vous devez enresitrer votre lecteur via SQL :

insert  into players (name, enabled, priority, created_on, modified_on, data)  VALUES ('my_player', false, max(priority)+1, now(), now(), '{}');

Maintenant, votre lecteur doit être visible dans les paramètres d'administration (section "Lecteurs") et vous devez simplement l'activer. Toutes les vidéos utiliseront ce lecteur (si c'est possible).

Installer votre extension

MediaCore utilise setuptools pour trouver les extensions. Toutes les extensions trouvées sont automatiquement activées, il n'y a pas moyen de désactiver une extension (pour le moment).

C'est pourquoi votre extension a besoin d'un fichier setup.py qui reseemble à cela (grossièremen):

#!/usr/bin/env python
from setuptools import setup, find_packages
setup(
      name='SamplePlugin',
      version='1.0',
      author='Felix Schwarz',
      author_email='info@schwarz.eu',
      license='GPL v3 or later',
      packages=find_packages(),
      entry_points = {
          'mediacore.plugin': [
              'sample = sample.mediacore_plugin',
          ],
      }
)

En plus de quelques meta-informations, le point d'intérêt est le entry_point 'mediacore.plugin':

La premireè partie ('sample = ...') définit le préfixe de l'URL où votre nouvelles pages/ressources est trouvée (http://<mediacore>/sample/….
De plus c'est la ligne qui dit à MediaCore où regarder pour l'extension (sample/mediacore_plugin.py). Ce fichier est exécuté quand MediaCore charge cette extension, et vous pouvez utiliser l'attribut __controller__ ici pour ajouter de nouvelles pages.

Avant que vous puissiez utiliser votre nouvelle extension, vous avez besoin de l'installer (n'oubliez pas d'activer votre virtualenv avant !) :

python setup.py install

Pour le développement, vous aimeriez utiliser le mode 'development' à la place de 'install' ce qui vous permet de ne pas à réinstaller votre extension à chaque fois que vous modifier le code.

Ceci est ma grossière approche des extension de MediaCore. J'ai simplement fait l'impasse sur un sujet : Events. Ceci est dû au fait que je pense que l'architecture de l'API le rend pratiquement inutile/

J'espère que vous avez aimé ce paragraphe. Dans le cas où vous découvririez des erreurs, veuillez me contacter par courriel. Souvenez vous également que vous pouvez télécharger le squelette de base.

Mes contributions et adaptations

J'ai contribué à quelques correctifs pour MediaCore, par exemple:

Infrastructure basique de l'internationalisation (i18n) et une grande partie de la traduction allemande
Support de Python 2.6
Meilleurs diagnostics pour les problèmes d'extensions

Ces correctifs ont été créés pendant que je modifiais MediaCore en fonction des besoins de mes clients.

Exemples de personnalisation commerciales

Bien que je ne partage pas les personnalisations que j'ai effectuées pour mes clients, j'aimerai vous fournir deux exemples de modifications pour vous donner une idée de ce qui a été fait :

Une "ONG allemande" a lancé une campagne pendant la Coupe du Monde de Football 2010 en Afrique du Sud. Le but était d'avoir une plateforme web où les utilisateurs (plus particulièrement des écoles) puissent "téléverser des vidéos et des images". J'avais besoin d'un site allemand et la possibilité de gérer les images dans une instance MediaCore.

Un "portail commercial" héberge des vidéos avec MediaCore but ne permettait pas d'"afficher des publicités dasn les vidéos en utilisant OpenX". J'ai dévelopé un lecteur MediaCore personnalisé qui affiche une liste d'annonces. J'ai également développé un carousel vidéo qui permet d'être embarqué dans d'autres sites (avec l'affichage de publicité). Toutes ces fonctionnalités sont contenues dans des extensions, MediaCore peut donc être facilement mis à jour.

Vous avez des questions ?

Comme toujours, si vous avez des questions à porpos de MediaCore, veuillez regarder dans forum des utilisateurs. Si vous avez des questions à propos du support commercial, d'une installation distante et de modifications, envoyez-moi simplement un courriel à info@schwarz.eu (en anglais ou allemand impérativement).

Credits

This translation was kindly contributed by Quentin Theuret (www.quentin-theuret.net). Thank you very much, Quentin.

Free Software is no silver bullet for niche markets

Posted by Felix Schwarz on 25 déc. 2008

Many people think of solid application which causes not too much trouble when they talk about Free Software. Most Free Software is given away for free, everyone can extend it as (s)he likes (presuming the necessary knowlegde) and there is no marketing department which urges for premature 'final' releases. Administrators often like that Free Software does not require any license management (if you just use it) and bug reports are openly shared because of the abundance of company politics which may try to maintain a zero-defect illusion even if the product is buggy like hell.

But the article's head line is 'Free Software is no silver bullet for niche markets' so you probably already expect that I will give you some counter-examples to the cliches above.

Pipe dreams meet real live

So today's example is Lx-Office (Wikipedia, sorry no English version available). Lx-Office is a web-based ERP and CRM system written in Perl and PHP and distributed under the GNU General Public License. It includes a module for financial accounting in companies. The ERP/accounting module of Lx-Office originated in SQL-Ledger (Wikipedia). After the fork from SQL-Ledger in 2003, Lx-Office was modified to meet German regulatory requirements.

In Germany, the majority of all tax advisors (approximately 80% according to Financial Times Germany) uses software produced by a German cooperative named DATEV (Wikipedia). This software is really the heart of a tax advisor's services, different versions support almost everything from financial accounting and tax calculation to payroll accounting.

So you can imagine that for a serious accounting software, data exchange with tax advisors is really a must-have. Though DATEV does only offer properietary software (which only runs on MS Windows), they sell a SDK which contains the complete specification of the KNE format ("Postversandformat"). With that format you can share accounting information with your tax advisor (e.g. the routine accounting is done in-house but the tax advisor should assemble the balance sheet at the end of the year).

The KNE format is an 'old-style' binary file format with a long track record (if you include the predecessor, the OBE format which was introduced in the late '80s). When I wrote libkne, a pure Python library to create and parse these binary KNE files, I tested my implementation against several real-world implementations. However, when I used the DATEV export using Lx-Office's demo server I was very surprised to see that the exported files where not valid according to the specification!

Export bug known for several years

It turned out that the problem is caused by different date string formatting. The export module DATEV.pm transforms a string which is presumed to contain a date ('25.12.2008', which is 2008-12-25 in ISO notation) into the KNE binary date format ('251208' - yes, the KNE format is not y2k-safe). As soon you use a different date format (like '12/25/2008'), the export just puts in the date string '12/25/2008', ignoring the size restrictions (6 bytes) and the format specification ('DDMMYY') completely.

What really puzzled me was that this bug is known for 2 1/2 years and nothing happened so far. In June 2006 there was a post in the user forum which explained the problem and even spotted the bug in the source code. Only thirty minutes later, another forum reader filed a bug report with a reference to the post. Six months later, a developer requested more information (which looks stupid to me, all necessary information was given) from the submitter but no further action was taken until now (December 2008). That means that the DATEV export of Lx-Office is non-functional for 2 1/2 years if you don't use one specific date format!

Looking at the source

As always, we can have a look at the source to reveal the issue:

# Code taken from SL/DATEV.pm (r3482)
# https://lx-office.linet-services.de/svn/trunk/unstable/SL/DATEV.pm
# Licensed under the GPL v2+
sub datetofour {
  $main::lxdebug->enter_sub();
  my ($date, $six) = @_;
  ($day, $month, $year) = split(/\./, $date);
  if ($day =~ /^0/) {
    $day = substr($day, 1, 1);
  }
  if (length($month) < 2) {
    $month = "0" . $month;
  }
  if (length($year) > 2) {
    $year = substr($year, -2, 2);
  }
  if ($six) {
    $date = $day . $month . $year;
  } else {
    $date = $day . $month;
  }
  $main::lxdebug->leave_sub();
  return $date;
}

So the datetofour function gets a date string $date and a boolean flag $six and returns the KNE date string. The problem is that the function assumes a certain date format in line 10 in the snippet (line 481 in original file as of r3482): 'split(/\./, $date)' But this date formatting can be configured by the user so the function should query the configuration for the format to use.

Conclusions

I presented you one example of a Free Software application which has a serious known bug that was not fixed for more that 2 1/2 years (though it is probably only a matter of one or two lines). The project itself is backed by two small companies and the software is used by quite a lot of people (Sourceforge counted 4468 downloads for Lx-Office ERP beta1 between 2008-08-12 and 2008-12-25). The project follows a traditional 'open source' development approach (public bug tracker and source control repository, community support through web forums) and uses a well-known license (GPL 2+). So this is clearly an example of a 'real' Free Software project.

Nevertheless the project failed to fix a simple bug so that users can not use functionality which is quite important in many enterprise use cases if you consider the extensive German regulatory policies. Obviously this functionality is not important enough for most users of Lx-Office. Probably (speculation!) many of them are in the 'low budget' segment which does not regularly exchange financial data with their tax advisors.

Admittedly financial accounting is really a niche use case (highly dependend on the country you're operating in, laws and policies change on a regular basis) so the user and developer number is quite small (compared to other software like Linux, Mozilla Firefox or OpenOffice). But it still gives evidence to my initial thesis that Free Software is not necessarily a solution for you. While Free Software in very common use cases can excel their proprietary competitors (e.g. Apache HTTP), you may find that you should evaluate Free Software in niche markets very carefully.

Brief remark: 'Batteries included' is the right approach for class libraries

While I described what code caused the bug on a technical level, the real (technical) problem for me is found below: In Perl 5 there is no built-in data type to represent dates. So the Lx-Office developers had to fall back on using strings to pass dates around. If Perl shipped some included module for date representation and manipulation, probably the bug would have never occured because things like date formatting would only influence the visual presentation.

So I think that on a class library level Python's approach of batteries included is the right one. A decent(tm) programming language needs to ship modules for date manipulation (even a less-than-optimal API like in Python is better than a completely missing one). Just having add-on modules for that is not sufficient because not all developers use them and it's hard to use multiple independently developed third-party modules with each other because they probably will use incompatible date representations...

How to make input validation really complicated

Posted by Felix Schwarz on 29 nov. 2008

Thanks to Vera Djuraskovic there is a Serbo-Croatian translation of this article available.

In every web application you need to validate your input data very carefully. Data validation is a very common task and so it's surprising that there are several validation libraries in Python (like formencode) which include validators for common tasks. However, Trac does not integrate any of these libraries so every plugin developer has to write their own validation code.

Now look how complicated you can check if a string does only contain the characters a-z, hyphens ('-'), underscore ('_'):

# Only alphanumeric characters (and [-_]) allowed for custom fieldname
# Note: This is not pretty, but it works... Anyone have an easier way of checking ???
f_name, f_type = customfield[Key.NAME], customfield[Key.TYPE]
match = re.search("[a-z0-9-_]+", f_name)
length_matched_name = 0
if match != None:
    match_span = match.span()
    length_matched_name = match_span[1] - match_span[0]
namelen = len(f_name)
if (length_matched_name != namelen):
    raise TracError("Only alphanumeric characters allowed for custom field name (a-z or 0-9 or -_).")

Please note how deep the author digged into Python's re library to find the span() method. So he first looks for an acceptable substring, computes the position of this substring, derives the length of the substring from that and checks if the length of the substring equals the length of the whole string.

At least the author had some doubts if his solution is the most elegant one (see the second line of the snippet above). So a simpler method of checking could be:

# Only alphanumeric characters (and [-_]) allowed for custom fieldname
f_name, f_type = customfield[Key.NAME], customfield[Key.TYPE]
match = re.search("([a-z0-9-_]+)", f_name)
if (match == None) or (match.group(1) != f_name):
    raise TracError("Only alphanumeric characters allowed for custom field name (a-z or 0-9 or -_).")

So now we got rid of all the index stuff. But still, you can do much easier than that by properly using regular expressions:

# Only alphanumeric characters (and [-_]) allowed for custom fieldname
f_name, f_type = customfield[Key.NAME], customfield[Key.TYPE]
if re.search('^[a-z0-9-_]+$', f_name) == None:
    raise TracError("Only alphanumeric characters allowed for custom field name (a-z or 0-9 or -_).")

Of course there is re.match but I try to avoid it due to personal issues with the method - I produced some bugs when using re.match previously.

You wonder which software does ship this mess? This code is part of a plugin which helps you to manage custom fields for your Trac tickets (CustomFieldAdminPlugin). Also the overall code quality of the module is quite poor so if you like to spend some time to learn refactoring, this is a good place to start.

TurboMail 3.0 Alpha1 released

Posted by Felix Schwarz on 16 nov. 2008

For quite a while no new versions of TurboMail were released and indeed TurboMail 2 just worked well for many users. But now there is an alpha release of the upcoming TurboMail 3.0 which is nearly a complete rewrite of TurboMail 2. With the new code structure we were able to add some nice features:

No strong coupling to TurboGears anymore (use TurboMail in command line scripts if you like!)
Different delivery methods like SMTP, local mailbox etc.
Different delivery strategies (queuing and immediate delivery)

To get more people involved, we decided to release an alpha version. That means the most important functionality is already in, but some things are missing right now. Furthermore there are some small known bugs (sending pre-formatted messages won't work) but we can work on that later. Besides that, I'm using a this alpha (with some custom patches) on production servers since several months and it works really great so far.

Download

Source
Python 2.3 (not tested yet!)
Python 2.4
Python 2.5
Python 2.6: sorry, no such version available here.

Documentation

I tried to built comprehensive documentation for every feature. This is still work in progress, but at least some docs are there:

HTML
PDF

If you have questions, patches or just want to rant about the alpha, please use the mailing list. The roadmap page shows the missing tasks. I'm currently writing a Python MTA to get real SMTP tests in again.

Migrating data from MS Access to MySQL easily

Posted by Felix Schwarz on 22 juin 2008

MySQL AB: If we are too slow, your bug must be invalid Most people will know the MySQL database. MySQL AB (recently bought by Sun) is the company behind that open source database and pays most developers working on the code.

Some years ago MySQL AB released a really nice tool which they named MySQL Migration Toolkit. Its a Java GUI application which can be used to copy data from other databases (like Microsoft Access) to a MySQL database. The tool even generates a lua script file so that you can automate the migration process afterwards.

In a recent project I was hired to build a replacement for a custom Access application. Part of the work was to create a new database schema. Of course the old data must be migrated so that it isn't lost. The chosen production database was PostgreSQL due to its solid track record when it comes to SQL features like transactions, referential integrity and all this boring stuff.

I figured out that it would be much easier for me converting data in the old database to an open source database and do the structure conversion in a second step with custom tools.

This was when the Migration Toolkit came into play: It copied the data from Access to MySQL and I wrote some scripts to get that data into Postgres. It worked really great after a very short time and everyone was happy. Sure there were some initial problems (like Migration Toolkit won't work with Java 1.6 but only 1.5) but these were solved quickly.

Some weeks later I discovered that some data in my new database was different from the Access data: In Access the value was 0.075, MySQL stored 0.08. This happened for all columns which were specified as "Single" with automatic precision in MS Access. The MySQL bugtracker revealed that others had the same problem: Access singles are converted to MySQL 7,2 double.

Unfortunately this bug was marked as 'invalid' when I found it two years after its creation (because the MySQL developer investigating it did not read the bug description correctly so he was not able to reproduce it). Bugs are far from unusual and so I was quite confident that I could jump in and help fixing it. Unfortunately, the bug was closed and therefore I was sure that no MySQL developer would ever look at it again. But their bug tracker won't let me open the bug as a normal user.

I pinged another MySQL dev directly (Susanne Ebrecht) and she reopened the bug :-) I followed her advise and tried to reproduce the problem which newer versions of MS Access. Clearly, the bug was still there...

Unfortunately, no MySQL developer had enough time to look at the report again and so an automatic process closed the bug again:

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Maybe I'm just too stupid but I was not able to re-open it. Maybe only the original reporter can. But the main problem is the crappy configuration of the MySQL bug tracker which closes bugs which are marked as 'need more info' after 30 days which is quite short. Of course you need some sort of 'timeout' in a bug tracker because else you get swamped with useless bug reports no one is able to reproduce so they stay open forever.

Other bug trackers implement that better: If a bug has the NEEDINFO status in Bugzilla there is a checkbox "I am providing the requested information for this bug." and afterwards the status is being set back to normal.

The problem with MySQL was that I provided all necessary information and the bug was closed only because MySQL AB worked too slowly!

This is an example how to not handle bug reports! If you don't have enough time to triage all bugs in a short period of time, then don't set the timeout to 30 days!

Which left me alone with the rounding problem... Luckily I was a able to use some kind of "dirty hack" to work around the whole issue: The problem is that the created MySQL database schemes are too imprecise. The rounding problem occurs not until the real data is copied which happens at a later stage.

So I extended the lua migration script to call a Python script of mine. This script just sets all columns with the type 'double(7,2) to 'DECIMAL(7,4)'. And this fixed the rounding errors for me. Of course it would be nicer if a proper fix was included in the Migration Toolkit itself but this seems to be out of reach...

Maybe my workaround can save you some trouble or time so I put my dirty hack online. Help yourself!

My lua adaptation:

(...)
  -- create database objects online
  print("Create database objects.")
  grtM.callFunction(
    grtV.toLua(targetConn.modules.TransformationModule), "executeSqlStatements",
    {targetConn, "global::/migration/targetCatalog", "global::/migration/creationLog"}
  )
  grt.exitOnError("The SQL create script cannot be executed.")
end
-- added this line to execute my Python script
os.execute("C:\\Python24\\python.exe \"c:\\migration\\fix_access_single_migration.py\"")
-- ------------------------------------------------------------------
-- checkpoint 5
-- Bulk data transfer
(...)

And this is my Python script (fix_access_single_migration.py):

#!/usr/bin/env python
# License: MIT licee
# Copyright (c) 2008 Felix Schwarz <felix schwarz AT oss schwarz eu>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
import MySQLdb
hostname = 'localhost'
user = 'root'
password = ''
db = ''
def get_all_tables(cursor):
    tables = []
    cursor.execute('show tables')
    rows = cursor.fetchall()
    for row in rows:
        tables.append(row[0])
    return tables
def find_double_fields(cursor, tables):
    problematic_fields = []
    for table in tables:
        cursor.execute(u'show fields from `%s`' % table)
        rows = cursor.fetchall()
        for row in rows:
            col_name = row[0]
            col_type = row[1]
            if col_type == u"double(7,2)":
                problematic_fields.append({"table": table, "col_name": col_name})
    return problematic_fields
def fix_fields(cursor, problems):
    for problem in problems:
        table = problem["table"]
        col_name = problem["col_name"]
        sql = u'ALTER TABLE `%s` CHANGE COLUMN `%s` `%s` DECIMAL(7,4)'
        cursor.execute(sql_template % (table, col_name, col_name))
conn = MySQLdb.connect(host=hostname, user=user,
                       passwd=password, db=db,
                       use_unicode=True, charset="cp1250")
conn.begin()
cursor = conn.cursor()
tables = get_all_tables(cursor)
problems = find_double_fields(cursor, tables)
fix_fields(cursor, problems)
conn.commit()
conn.close()