PHP preg_split

Soit un code html, n’importe lequel, obtenu depuis un formulaire…

code html d’exemple :

<h2>
<a href="index.html">
<font color="#0000ff"><i>Open</i></font><font color="#000084">BSD</font>
</a>
<font color="#e00000">macppc</font>
</h2>
<hr>
<p>

OpenBSD/macppc runs on the PowerPC-based Macintosh systems from the
<i>``New World''</i> family, i.e. all Apple computers from the iMac to current
models. It does not run on any <a href="#unsup">unsupported models</a>.
<p>

A mailing list dedicated to the OpenBSD/macppc port is available at
<u><font color="#23238e">ppc@openbsd.org</font></u>.
To join the OpenBSD/macppc mailing list, send a message body of <b>"subscribe
ppc"</b> to
<a href="mailto:majordomo@openbsd.org">majordomo@openbsd.org</a>.
Please be sure to check our <a href="mail.html">mailing list policy</a> before
subscribing.

<br clear=all>
<hr>

<h3 id="history"><font color="#0000e0"><strong>History:</strong></font></h3>

<p>
The OpenBSD/macppc port started as OpenBSD/powerpc, and was initially
focused on Motorola computers with Open Firmware, and VI Power4e boards.
This port was eventually thrown away after OpenBSD 2.5 was released.
As a result there was no OpenBSD/powerpc port for the 2.6 and 2.7 releases.
In the meantime, a new port was started, focusing on Apple hardware, and
based on code from NetBSD/macppc, and after a lot of work from Dale Rahn,
OpenBSD 2.8 was released with a powerpc port.
As work on the port continued, it was renamed to OpenBSD/macppc for 3.0.
Support for the 64-bit G5 (running in 32-bit mode) was added in OpenBSD 3.9.

<hr>

(rohhh, la purée de promotion…) :stuck_out_tongue:

La variable post est filtrée ainsi, en PHP :

$thtml = filter_var( $_POST['thtml'], FILTER_SANITIZE_FULL_SPECIAL_CHARS );

La variable $thtml est bien de type ‘string’, ce que confirme le retour ci-dessous…

Ensuite, j’utilise le code PHP suivant :

$array = preg_split('#<(/?)([^>]*)>#', $thtml, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY); 
var_dump($array);

Le retour de la function ‘‘var_dump()’’ est celui-ci :

array(1) { [0]=> string(1882) "<h2> <a href="index.html"> <font color="#0000ff"><i>Open</i></font><font color="#000084">BSD</font> </a> <font color="#e00000">macppc</font> </h2> <hr> <p> OpenBSD/macppc runs on the PowerPC-based Macintosh systems from the <i>``New World''</i> family, i.e. all Apple computers from the iMac to current models. It does not run on any <a href="#unsup">unsupported models</a>. <p> A mailing list dedicated to the OpenBSD/macppc port is available at <u><font color="#23238e">ppc@openbsd.org</font></u>. To join the OpenBSD/macppc mailing list, send a message body of <b>"subscribe ppc"</b> to <a href="mailto:majordomo@openbsd.org">majordomo@openbsd.org</a>. Please be sure to check our <a href="mail.html">mailing list policy</a> before subscribing. <br clear=all> <hr> <h3 id="history"><font color="#0000e0"><strong>History:</strong></font></h3> <p> The OpenBSD/macppc port started as OpenBSD/powerpc, and was initially focused on Motorola computers with Open Firmware, and VI Power4e boards. This port was eventually thrown away after OpenBSD 2.5 was released. As a result there was no OpenBSD/powerpc port for the 2.6 and 2.7 releases. In the meantime, a new port was started, focusing on Apple hardware, and based on code from NetBSD/macppc, and after a lot of work from Dale Rahn, OpenBSD 2.8 was released with a powerpc port. As work on the port continued, it was renamed to OpenBSD/macppc for 3.0. Support for the 64-bit G5 (running in 32-bit mode) was added in OpenBSD 3.9. <hr>" } 

Il me semblait avoir compris qu’en utilisant le pattern suivant <(/?)([^>]*)>, je construisais un tableau à chaque tag qu’il trouve dans le code HTML posté !
Apparemment, non !
Puisque comme on peut le remarquer, je n’ai qu’une seule clé dans mon tableau !
Quid ?

Seul Chuck Norris sait parser du XML/HTML avec des regex!
https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

Regarde du côté de la classe DOMDocument.

Merci…
Et, surtout merci pour les poncifs !

De toute façon, même avec SimpleDom, ou en attaquant directement le dom, je n’arrive à rien !

Et, ça m’énerve… parce que je ne comprends pas !
Grrrr

Normal, le code html en entrée est non conforme (balises non fermées, non standard etc…). Tu n’arriveras à rien quelque soit l’outil utilisé. Si tu cherches à isoler un noeud ou valeur de noeud bien précis, on peut s’en tirer avec une regex. Par contre si tu veux extraire chaque noeud et sa valeur, il faudra d’abord nettoyer le code html et travailler récursivement avec DOMDocument.

J’ai passé un petit coup de Tidy sur le code d’origine et corrigé manuellement les balises non conformes.

<?php
$thtml = <<<HEREDOC
<html>
<h2>Titre</h2>
<p>OpenBSD/macppc runs on the PowerPC-based Macintosh systems from
the <i>``New World''</i> family, i.e. all Apple computers from the
iMac to current models. It does not run on any <a href=
"#unsup">unsupported models</a>.</p>

<p>A mailing list dedicated to the OpenBSD/macppc port is available
at <u><font color="#23238E">ppc@openbsd.org</font></u>. To join the
OpenBSD/macppc mailing list, send a message body of <b>"subscribe
ppc"</b> to <a href=
"mailto:majordomo@openbsd.org">majordomo@openbsd.org</a>. Please be
sure to check our <a href="mail.html">mailing list policy</a>
before subscribing.</p>

<h3 id="history"><font color=
"#0000E0"><strong>History:</strong></font></h3>

<p>The OpenBSD/macppc port started as OpenBSD/powerpc, and was
initially focused on Motorola computers with Open Firmware, and VI
Power4e boards. This port was eventually thrown away after OpenBSD
2.5 was released. As a result there was no OpenBSD/powerpc port for
the 2.6 and 2.7 releases. In the meantime, a new port was started,
focusing on Apple hardware, and based on code from NetBSD/macppc,
and after a lot of work from Dale Rahn, OpenBSD 2.8 was released
with a powerpc port. As work on the port continued, it was renamed
to OpenBSD/macppc for 3.0. Support for the 64-bit G5 (running in
32-bit mode) was added in OpenBSD 3.9.</p>
</html>
HEREDOC;

$dom = new DOMDocument;
$dom->loadXML($thtml);
$firstChild = $dom->firstChild;
scanSource($firstChild);

function scanSource(DomNode $node, $level=0){
  processNode( $node, $level );
 if ( $node->hasChildNodes() ) {
   $children = $node->childNodes;
   foreach( $children as $child ) {
     if ( $child->nodeType == XML_ELEMENT_NODE ) {
       scanSource( $child, $level+1 );
     }
   }
 }
}

function processNode(DomNode $node, $level) {
  if ( $node->nodeType == XML_ELEMENT_NODE ) {
    printf("%s[%s] %s<br>", str_repeat("--- ", $level), $node->nodeName, $node->nodeValue);
  }
}
?>
1 J'aime

Je te remercie, on en reparle un peu + tard.
Parce que ce n’est pas si simple…
Mais, en effet, ça doit être un problème de formatage :wink:

Autre piste: en relisant ton code plus haut, je m’aperçois que tu appliques le masque de la regex sur des entités html. Une balise <a> sera transformée en &lt;a&gt; par la fonction php filter_var() et le masque ne trouvera donc pas ce qu’il cherche.

Non, là, tu revois, stp, les différents types de filtre et tu te rendras compte que j’ai utilisé le bon type adéquat…
Seuls sont transformés certains caractères en entité HTML :wink:

Alors on ne doit pas avoir le même version de php (5.6.30 chez moi) … ou la même lecture de la doc. Que donne chez toi, en ligne de commande:

$ php -r 'echo filter_var("<a>", FILTER_SANITIZE_FULL_SPECIAL_CHARS );'
&lt;a&gt;

OUI, ok, les chevrons qui sont transformés !!!

Bien vu :wink:

$ php-7.0 -r 'echo filter_var("<a>", FILTER_SANITIZE_FULL_SPECIAL_CHARS );'
&lt;a&gt;

Je sait pas si tu a résolu ton soucis mais en php quand on variabilise des string html il faut coder ton php de la manière suivante :slight_smile:

<php?
$maVariable = ‘du text, du html, etc, sauf du php’.$une_variable_php.‘re du texte, etc’;
?>

Enfin bref il faut concaténer et bien ouvrir et fermer les balises php.

Bon courage.

PS: Et codage de la page en UTF8