Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^2: Begginer's question: If loops one after the other. Is that code correct? (updated)

by predrag (Scribe)
on Jan 10, 2017 at 15:24 UTC ( [id://1179321]=note: print w/replies, xml ) Need Help??


in reply to Re: Begginer's question: If loops one after the other. Is that code correct? (updated)
in thread Begginer's question: If loops one after the other. Is that code correct?

Thanks a lot to everyone. I've already seen perlmonks as a friendly place but am really more then pleasantly surprised with so fast answers and warm welcome. Except Perl and a little bash scripting I wasn't in programming, only a long ago I've learned some Fortran but forgot a lot. So it was a prety strange for me to see that starting with Perl is not so difficult for a beginner as some people write on the web, but of course, I also see how Perl is complex and powerful. Anyway, I became addicted to do something in Perl everyday...

I will remember terms and other suggestion, in any case will try to combine the first and last block as you suggested.

I understand that nobody can help more seriously if there is not more of the code, but for the beginning, the most important for me was whether it is ok or not.

The code I've sent is simplified because I wanted to concentrate just on one question. It is a part of the script of around 160 lines, that converts a html page from Serbian Latin to Cyrillic alphabet. That script works well, gives desired result (only for simple html pages). The task of these if blocs in foreach loop is to divide html code from the page content that will be converted. I know about Parser module and already installed and tried some simple examples but didn't know how it can help me in this case, so went back to my first solution.

But, similar to my question in this node, although it gives good result, I would like to hear comments on that converter code, even if it is maybe ridiculous solution. I wonder what will be the most appropriate way for that - to start new node or in this one. The whole code could be probably be too much, so I could explain my approach only by words, or give some details of the code. I don't know anyone with whom I could talk about perl, so today is a big change for me.

  • Comment on Re^2: Begginer's question: If loops one after the other. Is that code correct? (updated)

Replies are listed 'Best First'.
Re^3: Begginer's question: If loops one after the other. Is that code correct?
by haukex (Archbishop) on Jan 10, 2017 at 16:30 UTC

    Hi predrag,

    It sounds like the trickiest part of your current solution is probably figuring out whether you're in some part of the HTML code or whether you're in the text, since obviously tags shouldn't be converted to Cyrillic. Unfortunately, parsing HTML is a pretty difficult task (a humorous post about the topic). So I'd like to encourage you to look at one of the parser modules again.

    Two classic modules are HTML::Parser and HTML::TreeBuilder, but there are several others, such as Mojo::DOM. If the input is always XHTML, there's XML::Twig and many more XML-based modules. These modules generally break down the HTML into their structure, including elements (<tags>) with their attributes, comments, or text. Some of the modules then represent the HTML as a Document Object Model (DOM), which is also worth reading a little about. It sounds like you only want to operate on text, and maybe on some elements' attributes (such as title="..." attributes).

    Operating only on text is relatively easy: for example, in a HTML::Parser solution, you could register a handler on the text event, which does the appropriate conversions, and register a default handler which just outputs everything else unchanged:

    use warnings; use strict; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3, unbroken_text => 1 ); $p->handler(text => sub { my ($text) = @_; # ### Your filter here ### $text=~s/foo/bar/g; print $text; }, 'text'); $p->handler(default => sub { print shift; }, 'text'); my $infile = '/tmp/in.html'; my $outfile = '/tmp/out.html'; open my $out, '>', $outfile or die "open $outfile: $!"; # "select" redirects the "print"s my $previous = select $out; $p->parse_file($infile); close $out; select $previous; print "$infile -> $outfile\n";

    Operating on attributes will require you to handle opening elements (tags) as well. Note also that the same basic principle I described above applies to the other modules: they all break the HTML down into its components, so that you can operate on only the textual parts, leaving the others unchanged.

    BTW, have you seen Lingua::Translit?

    Hope this helps,
    -- Hauke D

      Thanks a lot Hauke D. Yes, it was my problem to resolve - position. No, I didn't know about Lingua::Translit

      When converting into Cyrilic, it is important to know that not all Cyrilic letters have one to one Latin letter correspondence, there are few that have two Latin letters correspondence.

      Oh, You already posted a code for converting with Parsel, thank you so much. I will save that code and try later. I would show my work although :) if for nothing else, it may be a fun for real programmers to look at it. I am maybe too cruel to my work. I resolved problem with html &nbsp also, so my script puts that into code and doesn't convert it.

      I think I have instaled HTML::TreeBuilder also. For now I work Perl in Virtual Machine where I have CentOS 6.7 and an old Ubuntu and I was pretty lucky to install modules in CentOS, I was a bit scared before. I have instaled Perl for Windows too

        Here is the code of the converter. It is for my non commercial beekeeping website that is on Serbian Latin alphabet. I am working on its new design and would like to have it converted into Cyrillic too. It is not small site, maybe few hundreds pages. I know there is a software for that but somehow, I don't like it. Till recently I never expected I could even have site in Cyrillic, or even could try to do it myself, but with Perl I think it is possible, even for such beginner like me.

        It works for simple html pages. If I have external CSS files, maybe it will work with CSS pages too, didn't try yet. So, I ask monks just for the comments on my approach.

        It reads a html file, converts the text into Cyrillic, leaves code untouched, and creates new html file in Cyrillic. Next steps are to read whole directory or whole website, and a lot of other things to be done, but it is not a part of my question now.

        I read that input file/string part by part, where one part is either string code between <> or string with text > < To determine where is a code and where is a text, I have a parameter k that after "<" receives value 1 and after ">" value 2.

        Subroutine converts strings. A hash contains dictionary of one to one equivalents. Letters that are the same (for example "a", "e" etc.) are omitted and I wonder if is it ok, for example are Latin and Cyrillic letter "a" are the same in html file and coding?

        script prints output file on standard output too

        #!/usr/bin/perl use strict; use warnings; use utf8; binmode(STDOUT, ":utf8"); use open ':encoding(utf8)'; # input/output default encoding will be # UTF-8 my $infile; # reads input file into string $infile open INPUT, "<index_latin.html"; undef $/; $infile =<INPUT>; close INPUT; my $k; # parameter =1 between < > , =2 between > < my $string; # "<code between>" my $txtstring = ''; # >"text between"< my $outcode = ''; # output: code and converted text together my $for_conv; # string to be converted by sub my $char; # chatacter from input file my $convert; # converted string by sub # splits input file into characters foreach $char (split//, $infile) { if ($char eq "<") { $k = 1; } if ($k ==2) { $txtstring= $txtstring . $char; } else { $string = $string .$char; } if ($char eq ">") { if (substr($txtstring, 0, 1) eq "&" ){ # &nbsp will not be converted $string =$txtstring.$string; #goes to string code $txtstring = ''; ## } $for_conv = $txtstring; $convert = konverter($for_conv); $outcode = $outcode .$convert.$string; $k = 2; $string = ''; $txtstring = ''; } # of if char eq ">" } # of foreach # writing to file my $filename = "index_cyrilic.htm"; open(FH, '>', $filename) or die $!; print FH $outcode ; close(FH); <readmore> print "\n"; print "code on the output:\n"; print "\n"; print "$outcode\n"; # converting string into Cyrillic sub konverter { # dictionary my %dict = ( "b"=> "&#1073;","B"=> "&#1041;","c"=> "&#1094;","C"=> "&# +1062;","&#269;"=> "&#1095;","&#268;"=> "&#1063;","&#263;"=> "&#1115;" +,"&#262;"=> "&#1035;","d"=> "&#1076;","D"=> "&#1044;","&#273;"=> "&#1 +106;","&#272;"=> "&#1026;","f"=> "&#1092;","F"=> "&#1060;","g"=> "&#1 +075;","G"=> "&#1043;","h"=> "&#1093;","H"=> "&#1061;","i"=> "&#1080;" +,"I"=> "&#1048;","l"=> "&#1083;","L"=> "&#1051;","m"=> "&#1084;","n"= +> "&#1085;","N"=> "&#1053;","p"=> "&#1087;","P" => "&#1055;","r" => " +&#1088;","R" => "&#1056;","s"=> "&#1089;","S"=> "&#1057;","š"=> "&#10 +96;","Š"=> "&#1064;","t"=> "&#1090;","u"=> "&#1091;","U"=> "&#1059;", +"v"=> "&#1074;","V" => "&#1042;","z"=> "&#1079;", "Z" => "&#1047;"," +ž"=> "&#1078;","Ž"=> "&#1046;"); my @conv_arr = split (//, $for_conv); # splits input string for conv +ersion my $ind = 0; # index of array element my $out = ""; # output, converted string my $str_char; # string character my $next; # next string character my $nj; # Latin two character letters to be replaced with one Cyrilli +c my $Nj; my $lj; my $Lj; my $dz; my $Dz; while ($ind <= $#conv_arr){ $str_char = $conv_arr[$ind]; # current character if ($ind ==$#conv_arr) { $next =""; # there are no more characters } else { $next =$conv_arr[$ind+1]; # next character } if (exists ($dict{$str_char})) { # combination nj gives $nj = "&#1114;" if (($str_char eq "n") && ($next eq "j")){ $nj = "&#1114;"; $out = $out.$nj; $ind = $ind+1; } elsif (($str_char eq "N") && ($next eq "j")){ $Nj = "&#1034;"; $out = $out.$Nj; $ind = $ind+1; } elsif (($str_char eq "l") && ($next eq "j")){ $lj = "&#1113;"; $out = $out.$lj; $ind = $ind+1; } elsif (($str_char eq "L") && ($next eq "j")){ $Lj = "&#1033;"; $out = $out.$Lj; $ind = $ind+1; } elsif (($str_char eq "d") && ($next eq "ž")){ $dz = "&#1119;"; $out = $out.$dz; $ind = $ind+1; } elsif (($str_char eq "D") && ($next eq "ž")){ $Dz = "&#1039;"; $out = $out.$Dz; $ind = $ind+1; } else { # one character letters $out = $out.$dict{$str_char}; } $ind++; } # of if exists else { $out = $out.$str_char; $ind++; } } # of while return $out; } # of sub </readmore>

        Here is the html code of input file index_latin.html for testing

        it is the output code, I hope that I was successful with readmore

      Hauke D, I've tried your code with Parser module, works very well, thanks again. I understand the use of  s/foo/bar/g; now. It is a next important step for me, I am really encouraged to try different uses of that module on my site, as well as other modules you suggested. Maybe even for some simple search machines etc.

      It seems Parser doesn't work on non braking space

       <p>&nbsp;</p>

      but I think I can resolve that

Re^3: Begginer's question: If loops one after the other. Is that code correct? (updated)
by 1nickt (Canon) on Jan 10, 2017 at 15:43 UTC

    Hi predrag, can we use your first paragraph above in our advertising?

    Seriously, Perl is great, isn't it!

    You may have had a difficult time experimenting with Parser modules, but it's definitely the right approach. You may be surprised at how simple the code can be. Someone here has probably got experience and good advice for you.

    Do not worry about long code example, just use the <readmore></readmore> tags so you don't show it all. See Writeup Formatting Tips.

    Also make sure to post some sample input to your program and the output you desire for that input.

    edit: added link


    The way forward always starts with a minimal test.

      Thank you 1nickt. Of course, you can use that paragraph for Perl advertising and everything else from my posts.

      It was my real experience and I should add more: I am 57 years old now, I do have education in electronics, but it was long ago and I never worked in that field professionally. Maybe helped because I had a good approach to learn and practice Perl, through examples and many different tasks I've set to myself. When I have to realize my ideas I always divide task into many small parts, check everything, put counters in loops, print counters etc.

      Last winter, after finishing two Linux trainings I've asked my Linux teacher about a problem I could not resolve through bash scripting. He mentioned perl, pyhton, and ruby, but he wasn't so found of Perl. Somehow, I choose perl, although, I must admit, I didn't find many recommendations on the web. But as a very beginner I eventually wrote scripts for LZW encoding and decoding in Perl, not using modules. Of course, I've used some elements of the code from the web and did maybe a hundred different small examples and tests, but at the end I was successful. In that time, modules were too complicated to me, so I decided to go by my own foot. What was also surprising to me that I've understood with Perl I can do many things for myself that I never expected before I would even try.

      Regarding that my script for converting, I will send it in my next post. Should that be in a new node?

      p.s. to check are Cyrilic letters visible, just few љ, ц,п,ж,Ђ

      In the text are visibe but in code not, should show: $lj = "љ";

      use utf8; binmode(STDOUT, ":utf8"); use open ':encoding(utf8)'; # input/output default encoding will be # UTF-8 elsif (($str_char eq "l") && ($next eq "j")){ $lj = "&#1113;";

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1179321]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-04-19 02:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found