ADVICE: running Gallery2 (g2) in the non-UTF-8 environment

nepto
nepto's picture

Joined: 2004-07-06
Posts: 22
Posted: Sun, 2005-01-09 18:15

Gallery2 is technology cutting egde. It uses mostly the actual, recent and recommended technologies, such as build-in UTF8 support and others

However someone may want to run Gallery2 integrated into non-UTF-8 website. In this case, some methods needs to be altered. See following patch:

--- GalleryUtilities.class.ori  2005-01-09 19:39:58.000000000 +0100
+++ GalleryUtilities.class      2005-01-09 19:39:53.000000000 +0100
@@ -856,7 +856,9 @@
             */
 
            /* Convert UTF-8 to Unicode entities */
-           $value = GalleryUtilities::utf8ToUnicodeEntities($value);
+// Disabled by Nepto [2005-01-09]
+// Also two htmlentities() calls below were changed to htmlspecialchars()
+//         $value = GalleryUtilities::utf8ToUnicodeEntities($value);
            /* Sanitize the rest of the contents
             */
            if ($value) {
@@ -885,12 +887,12 @@
                if (strlen($chunk)==1) {
                    $rawText .= $chunk;
                } else {
-                   $cookedText .= htmlentities($rawText, ENT_QUOTES);
+                   $cookedText .= htmlspecialchars($rawText, ENT_QUOTES);
                    $cookedText .= $chunk;
                    $rawText = '';
                }
            }
-           $cookedText .= htmlentities($rawText, ENT_QUOTES);
+           $cookedText .= htmlspecialchars($rawText, ENT_QUOTES);
        } else {
            return $text;
        }

It is easily understandable why is utf8ToUnicodeEntities() call disabled. Also htmlentities() were substituted for htmlspecialchars(), becasue my gallery runs in iso-8859-2 encoding which is not supported by htmlentities(). For iso-8859-1 encoding this could be probably skipped.

I run this on http://nepto.sk/gallery2/ and it works well.

Hope this helps someone...

Nepto

 
floridave
floridave's picture

Joined: 2003-12-22
Posts: 27300
Posted: Sun, 2005-01-09 19:16

Moved to G2 Support

 
bharat
bharat's picture

Joined: 2002-05-21
Posts: 7994
Posted: Mon, 2005-01-10 05:02

That makes sense to me. I'm not sure whether we want to go in this direction or not, though. We're kind of hoping that we can start pushing the standard towards UTF8 entirely by not supporting anything else except for this :-) I'll hang onto your patch, though...

 
baschny
baschny's picture

Joined: 2003-01-04
Posts: 328
Posted: Thu, 2005-01-13 09:16

nepto, makes sense to me too.

Some thoughts:

First, I think we don't need htmlentites() anyway. We use htmlentities() to sanitize user-input (as shown above), but this defaults to encoding ISO-8859-1 characters into HTML-entities, which is not really what we have (we have UTF-8 input). But since we did the utf8ToUnicodeEntities() before, we have straight ASCII left when we reach the htmlentities(), so replacing it with htmlspecialchars() seems to be the most logical consequence that shouldn't break anything (like in nepto's patch).

As for not using utf8ToUnicodeEntities(), I am not sure if we rely on having UTF-8-Entities (ASCII) when finished sanitizing the user input (e.g. when writing it into the database). It seems that nepto's patch will write ISO-8859-2 characters that came from the user directly into the database. Couldn't that be a problem somewhere else?

If we instead really just want UTF-8-Entities in the database, we should convert user-input to UTF-8 in our user-input sanitize routine: And this can only be done if the embedder supplies us with the information on what charset he is using on the embedding page (which the browser normally will use when submitting the form, despite some curious and buggy behaviours of some browsers on that matter, see this link).

Now it's your turn... :)

 
jmullan
jmullan's picture

Joined: 2002-07-28
Posts: 974
Posted: Thu, 2005-01-13 10:08

If we REALLY want to do support other character sets, then I would recommend adding a source charset argument to sanitizeInputValues. We could then call a charset conversion to UTF-8, do our entity magic, and leave the user with database and html-safe data. Then, no matter if they change to another charset on their website, their data won't need to be scrubbed again.

However, I must voice my concerns again with using a charset other than UTF-8.

Anyway, you will have to change more stuff in gallery2 to support other character sets. Your translation must be in the new character set. Any other translations that you enable should be converted to that character set. That isn't possible with most character sets, but if you're willing to try...

Maybe I'm just wildly rambling because it is 4am... ;)

 
baschny
baschny's picture

Joined: 2003-01-04
Posts: 328
Posted: Thu, 2005-01-13 11:32

jmullan, smartHtmlEntries is the one that calls htmlentries(), and this is where nepto's patch is applied, so I guess he is using a rather new codebase. I agree with having that changed to use htmlspecialchars(), as this method currently is only being called by the user-input sanitation. See my comments on that above.

The only problem with having the embedding application use another charset is on user-input from FORMs: All other elements that we print out are HTML-entities, so it doesn't matter which encoding the page uses. So no other stuff needs to be changed in Gallery2, if we convert the user-input to UTF-8-Entities, like we do now: We just have to know the submitted encoding, and the embedder needs to tell gallery of that, which is what I suggested in my comments.

So I think we agree? :)

 
jmullan
jmullan's picture

Joined: 2002-07-28
Posts: 974
Posted: Thu, 2005-01-13 19:35

Okay, I took that part out, but the rest still applies.

Just so we're clear on terminology, an html entity starts with an ampersand and is either a keyword or numeric representation of a character value. These are NOT UTF-8 entities, but you could say that they are unicode html entities, because the numeric entities (whether hexidecimal or decimal) refer to the unicode character space.

UTF-8 applies to the multi-byte encoding of characters.

This doesn't make a huge difference, code-wise - but it means that we need an intermediate step to go from ISO-8859-2 to UTF-8 before we can convert to entities, unless we reuse the php-based converters in the migration module, because go straight to unambigous html entities.

Not all of the output that we produce are entities - translations are in UTF-8, but use multi-byte characters rather than entities - unless there is a conversion in there somewhere that I am not aware of. We could probably write something that would do this conversion in advance so that the translation files only contain ASCII and html entities.

So, to support another character set, we need to:
Convert all UTF-8 output to html entities in the templates and translation files
Convert all non-UTF-8 input to UTf-8, and then to html entities (possibly in one step)

We should change to htmlspecialchars() now, whatever we do. I did this in my dev install, but I won't check it in until I get home and can run some unit tests. ;)

 
ogardarsson

Joined: 2005-09-27
Posts: 2
Posted: Tue, 2005-09-27 13:44

Can someone tell me where neptos patch is supposed to be put?

 
ogardarsson

Joined: 2005-09-27
Posts: 2
Posted: Tue, 2005-09-27 13:45

Can someone tell me where neptos patch is supposed to be put?

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Tue, 2005-09-27 14:46

1. the patch is scrambled, funny enough, some characters were replaced by html entities. this must have happened in our HTML markup -> Gallery markup change in the forums. modules/core/classes/GalleryUtilities.class
2. the patch is not complete. i have the feeling a lot of other changes would be required too. and some things have changed since spring 2005.

 
Haplo
Haplo's picture

Joined: 2004-03-29
Posts: 82
Posted: Thu, 2005-09-29 17:10

anyone still looking into this?
i have been trying to make phpBB (with embedded gallery) behave utf-8, and it sort of will, but it all requires too much modding and makes it confusing.

This sounds interesting. The output tpl data must be possible to convert in the last stage "back" to non-utf8 - not to break all other embed apps out there.

What g2 do internally is one thing (to invoke localization gettext and more), but there should be some final output options :/

so this patch is broken?

also, how is it possible for g2 to scrample all embed cms contents. the cms header still has

Quote:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

and i see the lang specific chars in the source code. still all pages with gallery blocks present somehow will not output these the non-utf8 way...

even though the visible META says "iso-8859-1" the browser will use utf-8 coding.

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Thu, 2005-09-29 18:03

- we've discussed this utf-8 requirement issue and how to offer non utf-8 output a few weeks ago in #gallery
- we've also come to the conclusion that we should still handle everything in utf-8 internally and add an output filter *somehow*, run all output through a convert charset function
- but this would also mean that all input is in a non-utf-8 charset, so we'd have to convert all browser input to utf-8 too

given the huge todo list, this isn't a high priority for us. if someone understands the issue and is willing to fix G2, please submit us your patch and we'll make it prime-time ready.

 
Haplo
Haplo's picture

Joined: 2004-03-29
Posts: 82
Posted: Thu, 2005-09-29 18:05

ok, i understand your position

 
Haplo
Haplo's picture

Joined: 2004-03-29
Posts: 82
Posted: Thu, 2005-09-29 23:02

Still i would like to stress the need to make G2 non-utf-8 aware. I believe your great efforts making G2 supporting embedding and localization would be "wasted" if not this "filtering back and forth" to unicode is solved. Many cms packages out there are not yet utf-8 based...and even though G2 runs fine standalone i think it's full potential will be appreciated most as a key component of widely used cms systems.

G2 is impressive, feature-rich and truly beautiful in design - and this being strictly utf-8 compatible seems somewhat ackward. I respect your ambitious TO-DO list, and realize tweaking for 3rd party products may feel low priority, but i would really appreaciate if this would get more attention, and i'm convinced it would benefit G2 also.

I'm sitting here with a fully mxBB (www.mx-system.com) integrated G2, syncronizing users/groups on the fly, featuring both a main G2 block and an imageBlock - all parameters integrated in mxBB cache and userinterface, and ready to invite all mxBB users to the wonders of G2 (shipped as a mxBB module, not modding a single G2 file) - but still cannot use it since mxBB is unicode (phpBB based).

Sorry to insist, but this is high priority to me ;)

 
Haplo
Haplo's picture

Joined: 2004-03-29
Posts: 82
Posted: Fri, 2005-09-30 14:28

I have been scanning loads of phpbb posts related to utf-8, and in short you cannot make phpbb truly utf-8 aware. There has been serious efforts made by pichirichi (read article), but it's merely a temp solution with known shortcomings. Further, phpBB does not support mysql 4.1 or later (in which true uft-8 support was added). so this rules out phpBB as an embed candicate :(

Nevermind, i understand you great guys will add non-utf-8 suppport sooner or later, so i will not request further. For now some sort of ad hoc solution will be used ;)

:-)

 
Haplo
Haplo's picture

Joined: 2004-03-29
Posts: 82
Posted: Fri, 2005-09-30 14:30

well, maybe one more request.

It's no "big" deal if g2 is utf-8 for the main gallery block, since such a block will likely span an entire cms page anyway, thus not interfering with to many other cms blocks.

But the major issue is the imageBlock (random, latest), since this block is intended to be placed on alternative pages - in between lots of similar blocks with dynamic contents - where admin has minimum control of what chars are used. BUT this block has no input methods, only output, so using a "utfToIsoChars()" shouldn't be too hard ;)

i will experiment for a while to somehow send the header from the GalleryEmbed class, before sending it in GalleryTranslator class

Quote:
/* Set the appropriate charset in our HTTP header */
if (!headers_sent()) {
header('Content-Type: text/html; charset=UTF-8');
}

and only when using the imageBlock. That at least will solve this for now...

For example:
In the embed app ImageBlock script:

Quote:
//
// If this is not the main gallery page itself, do NOT send utf-8 header
// Note: this is a temp solution to avoid weird characters for mxBB if the imageBlock is used
//
/* Set the appropriate charset in our HTTP header */
if (!headers_sent() && $g2_page_id != $page_id) {
header('Content-Type: text/html; charset=' . $lang['ENCODING']);
}

//
// Hook up with ImageAlbum
//
$g2data = GalleryEmbed::getImageBlock($getImageBlockargs);
$bodyHtml = GalleryUtilities::utf8ToUnicodeEntities($g2data[1]);

if the embed app internal encoding is defined by $lang['ENCODING']

 
nepto
nepto's picture

Joined: 2004-07-06
Posts: 22
Posted: Sun, 2005-10-23 05:39

Probably I did not mention this clearly when I was making original post, but this is a hack, not actual patch. This is not supposed to be integrated into the G2. The G2 uses the UTF-8 and it is great. There should be just possibility to use G2 in non-UTF-8 enviroment. This is what hack is supposed to do.

My personal problem is, that I have no time to convert my whole http://nepto.sk/ website into the UTF-8. But one day I will do so and I than I will welcome build-in G2 UTF-8 support.

Take care guys, you did a great work since beginning of the 2005!

 
valiant

Joined: 2003-01-04
Posts: 32509
Posted: Sun, 2005-10-23 12:47