Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Using a config file in my regexp script.

by Tricky (Sexton)
on Sep 17, 2003 at 10:26 UTC ( [id://292090]=perlquestion: print w/replies, xml ) Need Help??

Tricky has asked for the wisdom of the Perl Monks concerning the following question:

Hi brothers,

Having put together the regexps to wack HTML tags/style attributes in my test page, I'm looking into implementing a config file, rather than hard-wiring my code.

I've never used configuration files before, and only recently come to understand their precise meaning. Sad huh ;)?! Here's a specification of my code (I'll include the script and the HTML below, for your perusal):

  • open and read-in an HTML file from hard-drive
  • slurp it into an array, to loop through
  • a 'master' subroutine calls all other subs, in-turn
  • subs strip tags, introduce and modify in-line styles
  • open an output filehandle and write the changes back to the HTML source file on-disk

I know that the HTML parser modules are better, I'm just exploring this approach for my MSc. Any ideas on how I could use a config file here?

The Perl code...

#!/usr/bin/perl # subsread2.plx package HTMLMods; =head1 DESCRIPTION Alternative to subread.plx - no control flows, just a 'master' sub wh +ich calls each sub to perform the HTML tag/attribute stripping/alteration, then returning the results. This application groups ALL the regexps into a single unit: 1. The HTML file is opened and inserted into an array (try this with +a scalar too!) 2. A master subroutine is called, which calls other subs to perform H +TML reformatting tasks 3. Each HTML reformatting sub completes its respective operations on +the HTML file 4. Reformatted array is printed in DOS window. 5. OR write changes back to HTML source file. =head2 ALTERNATIVE FILE OPENING CODE my $path = "E:/Documents and Settings/Richard Lamb/My Documents/HTML +"; open (INFILE, "$path/test1InLineCSS.html") or die ("$!: Can't open t +his file"); =head3 BACKREFS TO REMEDY ENTITY VALUE CHANGE PROBLEM? =cut use warnings; use diagnostics; use strict; # Declare and initialise variables. my @htmlFile; # Open HTML test file and read into array. open (INFILE, "E:/Documents and Settings/Richard Lamb/My Documents/HTM +L/test1InLineCSS.html"), or die ("$!: Can't open this file.\n"); @htmlFile = <INFILE>; close (INFILE); sub masterCall { scrapUnderlineTags(); scrapBoldTags(); scrapItalicsTags(); scrapEmphasiseTags(); changeFontStyle(); changeFontSize(); changeFontColour(); changeBackColour(); addTextIndent(); addWordSpacing(); addLetterSpacing(); scrapImageTag(); } masterCall(); # Subroutine defintions # Removes underline tags in array sub scrapUnderlineTags { # iterates through each element (i.e. HTML line) in array foreach my $line (@htmlFile) { $line =~ s/<\/u>//ig; # case insensitivity and global search for p +attern. $line =~ s/<u>//ig; } } # Removes bold tags in array sub scrapBoldTags { foreach my $line (@htmlFile) { $line =~ s/<\/?b>//ig; $line =~ s/<\/?big>//ig; $line =~ s/<\/?strong>//ig; $line =~ s/font-weight:\s?bold;?//ig; } } # Removes italics tags in array sub scrapItalicsTags { foreach my $line (@htmlFile) { $line =~ s/<\/?i>//ig; } } # Remove emphasise tags in array sub scrapEmphasiseTags { foreach my $line (@htmlFile) { $line =~ s/<\/?em>//ig; } } # Change font styles within in-line styles sub changeFontStyle { foreach my $line (@htmlFile) { $line =~ s/font-family:\s?Times;/font-family: Arial;/ig; } } # Change font size within in-line styles sub changeFontSize { foreach my $line (@htmlFile) { $line =~ s/font-size:\s?[0-9]{2}pt;?/font-size: 14pt/ig; } } # Change font colour within in-line styles sub changeFontColour { foreach my $line (@htmlFile) { $line =~ s/[^background-]color:\s?#(?:[0-9a-f]{6}|[0-9a-f]{3});?/" +color: #000000;/ig; } } # Changes background colour attributes in array sub changeBackColour { foreach my $line (@htmlFile) { $line =~ s/background-color:\s?#(?:[0-9a-f]{6}|[0-9a-f]{3});?/back +ground-color: #FFFFFF/ig; } } sub addTextIndent { foreach my $line (@htmlFile) { $line =~ s/(<h[0-6]\sstyle=.*)">/$1; text-indent: 10px">/ig; $line =~ s/(<li\sstyle=.*)">/$1; text-indent: 10px">/ig; $line =~ s/(<p\s style=.*)">/$1; text-indent: 10px">/ig; } } # Inserts word spacing entities within in-line styles sub addWordSpacing { foreach my $line (@htmlFile) { $line =~ s/(<h[1-6]\sstyle=.*)">/$1; word-spacing: 30px">/ig; $line =~ s/(<p\sstyle=.*)">[^<.*?>]/$1; word-spacing: 10px">/ig; $line =~ s/(<li\sstyle=.*)">/$1; word-spacing: 10px">/ig; } } # Inserts letter spacing entities within in-line styles sub addLetterSpacing { foreach my $line (@htmlFile) { $line =~ s/(<h[1-6]\s+style=.*)">/$1; letter-spacing: 3px">/ig; $line =~ s/(<li\sstyle=.*)">/$1; letter-spacing: 2px">/ig; $line =~ s/(<p\sstyle=.*)">/$1; letter-spacing: 2px">/ig; } } # Removes image tag in array sub scrapImageTag { foreach my $line (@htmlFile) { $line =~ s/<IMG\s+(.*)>//ig; } } # Print array to DOS window sub printHTML { for my $i (0..@htmlFile-1) { print $htmlFile[$i]; } } # Replacing original file with reformatted file! open (OUTFILE, ">E:/Documents and Settings/Richard Lamb/My Documents/H +TML/test1InLineCSS.html") or die("$1: Can't rewrite the HTML file.\n" +); print (OUTFILE @htmlFile); close (OUTFILE); # printHTML(); # sub called to print array in DOS window

And the HTML source code...

<!DOCTYPE html PUBLIC "-//W3C//DTD html 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <title>My First Page</title> <meta http-equiv="content-type" content="text/html; charset=iso-885 +9-1"/> </head> <body style="background-color: #8DCB41"> <h1 style="color: #648DC7; text-align: center; font-family: Time +s; font-weight: bold"> <i>Hello folks...This is my <big>first</big> page!</i> </h1> <h4 style="color: #648DC7; font-family: Times"> <form> <u>Username:</u> <input type="text" name="user"> <br> <u>Password:</u> <input type="password" name="password"> </form> </h4> <p style="font-family: Times; font-size: 10pt">Note that when yo +u type characters in a password field, the browser displays asterisks + or bullets instead of the characters.</p> <img src="images/dicky.jpg" height="350" width="350" title="Dick +ybloke!" style="text-align: center"> <hr> <h4 style="color: #648DC7; font-family: Times">My kinda places.. +.</h4> <ul type="square"> <li style="font-family: Times; font-size: 10pt"> <a href="http://www.bbc.co.uk/manchester/"><i>Manchester</i> +</a></li> <li style="font-family: Times; font-size: 10pt"> <a href="http://www.bbc.co.uk/leeds/"><i>Leeds</i></a></li> <li style="font-family: Times; font-size: 10pt"> <a href="http://www.bbc.co.uk/london/"><i>London</i></a></li +> </ul><hr> <h2 style="color: #2D73B9; text-align: center; font-family: Time +s">Ferocious Felines!</h2> <img src="images/pissedkitty.jpg" height="300" width="300" alt=" +A picture of a &quot;very&quot; upset kitten!"> <img src="images/pissedkitty.jpg" height="300" width="300" alt=" +A picture of a &quot;very&quot; upset kitten!"> <img src="images/pissedkitty.jpg" height="300" width="300" alt=" +A picture of a &quot;very&quot; upset kitten!"> <hr> <h4 style="color: #497FBF; font-family: Times">Places to visit a +nd go back to...</h4> <ul type="square"> <li style="font-family: Times; font-size: 10pt"> <a href="http://english.firenze.net/"><em>Florence</em></a>< +/li> <li style="font-family: Times; font-size: 10pt"> <a href="http://www.timeout.com/prague/"><em>Prague</em></a> +</li> <li style="font-family: Times; font-size: 10pt"> <a href="http://www.canada.com/vancouver/"><em>Vancouver</em +></a></li> <li style="font-family: Times; font-size: 10pt"> <a href="http://metromix.chicagotribune.com/"><em>Chicago</e +m></a></li> <li style="font-family: Times; font-size: 10pt"> <a href="http://www.sanfrancisco.com/"><em>San Fransisco</em +></a></li> <li style="font-family: Times; font-size: 10pt"> <a href="http://www.ny.com/"><em>New York</em></a></li> </ul> <hr> <h2 style="color: #497FBF; text-align: center; font-family: Time +s"><u><b>A Brief History to the Future summary:</b></u></h2> <p style="color: #648DC7; font-family: Times; font-size: 10p +t">The <b><u>Internet</u></b> is the most <strong>remarkable</strong> + achievement of humankind since the pyramids. The millennium from no +w, historians will look back at it and marvelled at the people equipped with + such conduct tools succeeded in creating such a leviathan. Yet even as the Net pervades our lives, we begin to take +it for granted. Many have lost the capacity for wonder. Most of us have no idea where the Interet came from, how +it works, or who created it and why. And even fewer have any idea what it means for society and future.</p> <p style="color: #648DC7; font-family: Times; font-size: 10p +t"><i>John Naughton</i> has written a warm and passionate book whose +heroes and the visionaries laid the foundations of postmodern world. A Brief History of the Future celebrates the engineers an +d scientists who implemented their dreams in hardware and software and explains the values and ideas that drove them. Altho +ugh its subject seems technical, the book in fact is a highly persona +l account. The author writes about the Net and way Nick Ho +rnby writes about soccer-as part of life, and as a key influence on h +is own voyage from solitary child to establish academic and +writer.</p> <p style="color: #497FBF; font-family: Times; font-size: 10p +t"><i><u>A Brief History of the Future</u></i> is an intimate celebra +tion of vision and al truism, ingenuity and determination, and above +all of the power of ideas transform the world.</p> <p style="color: #497FBF; font-family: Times; font-size: 10 +pt">John Naughton is an academic and a journalit. He teaches at the +Open University and has written an award-winning weekly column for the Observer for more than ten years. He lives in Ca +mbridge, England, and is a fellow of Wolfson College at the Universit +y of Cambridge.</p> <br> <h4 style="color: #497FBF; font-family: Times">Link</h4> <ul type="square"> <li style="font-size: 10pt; font-family: Times"> <a href="http://www.briefhistory.com/pages/bh-index.htm">A + Brief History of the Future</a></li> </ul> <hr> <h2 style="color: #497FBF; font-family: Times; text-align: cente +r">Rockerfellers - NOT!</h2> <img src="images/oldgrooves.jpg" height="450" width="450" alt="O +ld-time groovers!" style="text-align: center; border: 3px"> <p style="font-family: Times; font-size: 10pt"> <a href="http://www.bbc.co.uk/">This text<a/> will take you to + the BBC!! </p> <p style="font-family: Times; font-size: 10pt"> <a href="http://www.programmersheaven.com/">This text</a> is a + link to a developer's page.This text is a link to a developer's page +. </p> <p style="font-family: Times; font-size: 10pt"> You can also use an image as a link: <a href="http://www.google.com/"> <img src="images/thankyou.gif" height="75" width="75" alt="Goo +gle search"> </a> </p> <p style="font-family: Times; font-size: 10pt"> <a href="mailto:dikymintos@hotmail.com subject="Hello%20to%20m +e!">Send mail to Mintosville</a> </p> <h4 style="color: #497FBF; font-family: Times">Table headers:</h +4> <table style="margin-left:50px; border: 4px; border-color: #00 +0000"> <tr> <th>Name</th> <th>Telephone</th> <th>Address</th> </tr> <tr> <td>Dicky Mintos</td> <td>0161 2363736</td> <td>Flat 23, Lockes Yard, Manchester</td> </tr> </table> <br><br> <form action="mailto:dickymintos@hotmail.com" method="post" enc +type="text/plain"> <h3 style="color: #6F559D; font-family: Times">This form sends +an e-mail to Trixter...</h3> Name:<br> <input type="text" name="name" value="yourname" size="20"> <br> Mail:<br> <input type="text" name="mail" value="yourmail" size="20"> <br> Comment:<br> <input type="text" name="comment" value="yourcomment" size="4 +0"> <br><br> <input type="submit" value="Send"> <input type="reset" vale="Reset"> </form> </body> </html>

Replies are listed 'Best First'.
Re: Using a config file in my regexp script.
by jeffa (Bishop) on Sep 17, 2003 at 16:08 UTC
    Use a Config module! I really like Config::General myself. At this point, the easiest pieces of data assignment you could abstract out of your code are PATH and FILE - create a config file like so:
    PATH = "E:/path/to/html/file" FILE = "test1InLineCSS.html"
    And here is code to read it and open the file:
    use Config::General; my $conf = Config::General->new("foo.conf"); my %config = $conf->getall; my $filename = join('/',$config{PATH}, $config{FILE}); open INFILE, $filename or die "Can't open $filename: $!";
    You could also devise a scheme to load your regexes with the config file ... but hold the press right there. I for one am really getting tired shouting "Please use an HTML parser for this!"

    Please use an HTML parser for this!

    You state that "I know that the HTML parser modules are better..." No. You don't know that the HTML parser modules are better, you haven't used one yet! We keep telling you that they are better but you keep on trucking with your array of HTML lines. (And what happens when tags are split across lines? Your array solution falls apart!)

    You also stated that you are "...exploring this approach for my MSc." What?!? There is nothing "Masters" about "parsing" HTML contained in an array with regular expressions. (UPDATE: i should have said "directly parsing with regexes" - subtle difference) No, that is very UNDERgraduate, my friend. Still don't believe me? Read on.

    Your current method does this:
    • slurp the HTML file into an array
    • loop over that entire array to change X
    • loop again over that entire array to change Y
    • loop yet again over that entire array to change Z
    • loop again and again and again ...
    That is extremely inefficient. Now. Compare this with how an HTML parser works:
    • get a file handle on the HTML file
    • parse tags ... one at time
    • for each tag parsed, pass the tag contents off to a handler (a subroutine) if one exists
    • for each handler called, modify this tag and/or it's text/attributes and return the modified result
    In other words, you loop across the HTML file ONCE! Not 12 times like you have in your code.

    Now. Because i am really crazy, here is your a rewrite of your code. Maybe this will finally convice you to get on the right track. Maybe. ;)

    Note that i do not write back to the original file.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    

      In regards to jeffas final suggestion here is a utility that will strip out your inline style attributes, put them in an external stylesheet, output a new html file with the style attributes removed and the new css file linked in.

      All with HTML::TokeParser::Simple

      Some work to beautify the output or make it operate over mulitple files is left as an excercise for the reader

      #!/opt/perl/bin/perl -w use strict; use warnings; use HTML::TokeParser::Simple; use IO::File; my ( $htmlInfile, $htmlOutfile, $cssOutfile ) = @ARGV; $htmlOutfile ||= 'out.html'; $cssOutfile ||= 'out.css' ; my $htmlFile = new IO::File "> $htmlOutfile" or die "Can't open $ht +mlOutfile for writing: $!\n"; my $cssFile = new IO::File "> $cssOutfile" or die "Can't open $css +Outfile for writing: $!\n"; my $parser = HTML::TokeParser::Simple->new($htmlInfile); my %styles; while (my $token = $parser->get_token) { # link in our new css file if ($token->is_end_tag('/head') ){ $htmlFile->print( "<link rel='stylesheet' type='text/css' href +='$cssOutfile'>\n" ); } # find and remove inline style definitions if ($token->is_start_tag) { my $tag = $token->return_tag; my $attr = $token->return_attr; # If we have an inline style attribute get the value # then delete it from the tag if ( defined $attr->{style} ){ $styles{ $attr->{style} }{$tag} += 1; $token->delete_attr('style'); } } $htmlFile->print( $token->as_is ); } # print our collected styles into the css file while (my ( $style, $tags ) = each %styles ){ #li, p { font-family: Times; font-size: 10pt } $cssFile->print( join( ', ', sort keys( %$tags ) ) . " { $style }\ +n"); }
      --
      Clayton
        This does not handle fonts?
Re: Using a config file in my regexp script.
by Abigail-II (Bishop) on Sep 17, 2003 at 13:15 UTC
    • open and read-in an HTML file from hard-drive
    • slurp it into an array, to loop through

    And you do this project for some kind of degree? Surely by now you should know that HTML tags can be stretched over more than one line? Please do use an HTML parser instead of using a bunch of regexes. But I think that you were told so in a some previous threads as well.

    Abigail

Re: Using a config file in my regexp script.
by rupesh (Hermit) on Sep 17, 2003 at 12:41 UTC

    Are you doing the above only for a single run? If that is so, then you should go ahead and change the html manually, which will save you time as well as provide more flexibility. If that is not the case, then scripts written in perl would very much solve the purpose.

    slurp it into an array, to loop through

    You do not have to do that, just for the purpose of looping. File looping constructs are very flexible and solve the purpose very well.

    a 'master' subroutine calls all other subs, in-turn

    What do the other subs do? Is the subs in built in the scripts themselves? or are they dynamically generated?

    I know that the HTML parser modules are better

    Yes you are right. HTML parsers are simpler and they make your code easier to understand. Try the link for HTML Parsers Apart from which you can also find other modules related to html, xml parsers over here

    Happy scripting!!!

    Did you ever notice that when you blow in a dog's face, it gets mad at you but when you take him on a car ride,he sticks his head out the window and likes it?
Re: Using a config file in my regexp script.
by herveus (Prior) on Sep 17, 2003 at 16:25 UTC
    Howdy!

    First off, jeffa++!

    So, Tricky, you say you are investigating the use of regexp to parse HTML. Have you yet considered the outcome of that investigation to be "don't do that"? You seem fixated on finding a contrary answer despite the cogent arguments repeatedly adduced against the use of regexen.

    Are you really that dense?

    I don't know whether you came up with the question to investigate or if you had it thrust upon you. In either case, you need to step way back and reevaluate whether or not you *can* do what you propose. It's not a sign of stupidity to discern that the proposed task just cannot be done. It is one to stubbornly persist even after it has been shown to be not doable.

    Which shoe fits you?

    yours,
    Michael

      Did I miss something? Im going out on a limb here as a rather new Monk, however I really havent come across a reckless flame such as your post.

      And second, im sure its possible to do what he wants with a regexen. It would be ugly! The proper and appropriate question here is 'why?'

      In short, please keep your comments of being 'dense' and 'stupid' and 'stubborn' to youself.

      Feel free to downvote me, although I enjoy reading these boards to increase my knowledge and understanding of perl, not to tell people they are stupid for trying something without even explaining it.
        Howdy!

        Perhaps you've missed the ongoing saga as Tricky asked numerous questions that have led to this thread. He has been given ample reasons why his approach was fatally flawed, but has insisted that he had to go forward anyway. Kind of like the Light Brigade...

        If you were not aware of the history ex-thread, you certainly can be forgiven for your reaction.

        However, note several things. I only *asked* if he was as dense as it appeared. Subtle, but (in my opinion) important. On my use of "stubborn" and "stupid", I tried to avoid absolute assertions about the character of Tricky, giving him some wiggle room to save face.

        On the substance of your note: the full range of HTML markup is sufficiently complex in the right ways as to make parsing it by regular expressions impossible. A full-blown parser is required. I'm not able to explain it in gory details, but I recommend looking over Tricky's past posts to see how this whole meta-thread has developed.

        yours,
        Michael

Re: Using a config file in my regexp script.
by jdtoronto (Prior) on Sep 17, 2003 at 15:25 UTC
    I don't really understand what the config file is for? But yes, there are any number of ways of maintaining a config. But why? There are any number of HTML parser modules out there - as you should have read in the responses to other threads you participated in.

    With the possibility that HTML tags will run over multiple lines and all the other variability in html (like unbalanced cases of chars in tags, missing closing tags etc. etc) you will soon find that a simple regex solution will fall apart.

    Look at the links, spend a day or two going through the modules and I think you will soon understand WHY they were written in the first place.

    jdtoronto speaks with the voice of bitter experience!

Re: Using a config file in my regexp script.
by krisahoch (Deacon) on Sep 17, 2003 at 20:36 UTC

    Trickey

    Use an html parser, or as jeffa indicated css.

    Kristofer Hoch

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://292090]
Approved by broquaint
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-03-29 11:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found