Re: How to scraper ASP websites

Replies are listed 'Best First'.
Re^2: How to scraper ASP websites by Anonymous Monk on Sep 05, 2012 at 20:07 UTC
#!/usr/bin/perl use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my $url = ('http://www.folkeferie.dk/da/ferier/Aktuelle-chartertilbud- +--afbudsrejser/'); $mech->get($url); my $hsh={}; $links = $mech->find_all_links(url_regex=>qr/templates\/textPage\.aspx +\?id/i, text_regex=>qr/Afbudsrejser/i); foreach my $link (@$links) { $url = $link->url_abs(); $mech->get($url); my $content = $mech->content(); while ($content=~/tr class="bgrow1"><td>(.?)<\/td><td cla +ss="countryValue">(.?)<\/td><td class="destnameValue">(.?)<\/td><td + class="hotelNameValue">(.?)<\/td><td class="durationValue">(.?)<\/ +td><td align="RIGHT" class="priceValue"><a target="_blank" href="(.? +)">(.*?)<\/a><\/td>/gisxm) { $hsh->{'url'} = $6; $hsh->{'crap_id'} = ''; $hsh->{'date'} = $1; $hsh->{'country'} = $2; $hsh->{'destination'} = $3; $hsh->{'trip_type'} = $4; $hsh->{'trip_length'} = $5; $hsh->{'price'}=$7; print "$hsh->{'date'}, $hsh->{'country'}, $hsh->{'destina +tion'}, $hsh->{'trip_type'}, $hsh->{'trip_length'}, $hsh->{'price'}, +$hsh->{'crap_id'}, $hsh->{'url'}, $airport\n\n"; } } [download] Please have a look in this code and also check the link and tell me how can I scrape the details from here. Regards	[reply] [d/l]
Re: need help in scrapping asp site by davido (Cardinal) on Sep 06, 2012 at 06:42 UTC
This is less likely to get help than the node you messily copied and pasted it from. My recommendation is to think up an actual programming question relating to the code you are presenting. Something along the lines of: I'm trying to scrape a website. The following minimal code snippet is failing to produce the output I was expecting. I was expecting xyz, but instead I'm getting abc, plus an explosion of shards of solidified lava. I think the problem is with the pdq statement, but when I tried lmnop I got hot molten lava instead. How should I rewrite the thingamagizzer so that it would produce xyz rather than abc and hot lava? (Fill in the variables and problem description as necessary to reflect the current situation) Dave	[reply]
Re: need help in scrapping asp site by 2teez (Vicar) on Sep 06, 2012 at 05:22 UTC
hi, Please format your code properly. Check How do I post a question effectively? You can also consider using Perl::tidy on your script. Moreover, when you "copy" codes, you should atleast edit before re-posting. Check the "["download "]" at the end of your OP	[reply]
Re: need help in scrapping asp site by Anonymous Monk on Sep 06, 2012 at 06:56 UTC
Please check the code: use WWW::Mechanize; my $mech = WWW::Mechanize->new(); my @urls = ('http://www.folkeferie.dk/da/ferier/Aktuelle-chartertilbud +---afbudsrejser/'); foreach my $url (@urls) { $mech->get($url); my $hsh={}; $links = $mech->find_all_links(url_regex=>qr/templates\/textPage\. +aspx\?id/i, text_regex=>qr/Afbudsrejser/i); foreach my $link (@$links) { $url = $link->url_abs(); print "\n\n\n".$url."\n\n"; $mech->get($url); my $content = $mech->content(); print $content; while ($content=~/tr class="bgrow1"><td>(.?)<\/td><td clas +s="countryValue">(.?)<\/td><td class="destnameValue">(.?)<\/td><td +class="hotelNameValue">(.?)<\/td><td class="durationValue">(.?)<\/t +d><td align="RIGHT" class="priceValue"><a target="_blank" href="(.?) +">(.*?)<\/a><\/td>/gisxm) { $hsh->{'url'} = $6; $hsh->{'crap_id'} = ''; $hsh->{'date'} = $1; $hsh->{'country'} = $2; $hsh->{'destination'} = $3; $hsh->{'trip_type'} = $4; $hsh->{'trip_length'} = $5; $hsh->{'price'}=$7; print "$hsh->{'date'}, $hsh->{'country'}, $hsh->{'destinat +ion'}, $hsh->{'trip_type'}, $hsh->{'trip_length'}, $hsh->{'price'}, $ +hsh->{'crap_id'}, $hsh->{'url'}, $airport\n\n"; } } } [download] The site is developed in asp , so the source contents are not exact HTML format. That's why I am facing lots of problem in fetching data from this site.	[reply] [d/l]
Re^2: need help in scraping asp site by Athanasius (Archbishop) on Sep 06, 2012 at 07:31 UTC
When added to a regex, the `x` modifier tells the regex engine to ignore whitespace — that is, to omit the spaces, etc., in the regex from the pattern to be matched. So, if you are trying to match something like: `<td class="countryValue"> # ^ note the space` [download] and your regex has an `x` modifier, you must specify the space(s) to be matched explicitly. For example: `<td \s+ class="countryValue">` [download] That said, when I run your code with this fix applied: `while ($content =~ m! tr \s+ class="bgrow1"> <td> (.?) + # $1 </td> <td \s+ class="countryValue"> (.?) + # $2 country </td> <td \s+ class="destnameValue"> (.?) + # $3 destination </td> <td \s+ class="hotelNameValue"> (.?) + # $4 </td> <td \s+ class="durationValue"> (.?) + # $5 trip_length </td> <td \s+ align="RIGHT" \s+ class="priceValue"> <a \s+ target="_blank" \s+ href="(.?)"> + # $6 url (.?) + # $7 </a> </td> !gisxm)` [download] the regex still gets no matches, so there is more wrong than just the missing whitespace. (Or, there is more whitespace lurking in the target webpages than I have allowed for.) For further help from the monks, please follow the advice given above by davido, and reduce your problem to a minimal* code snippet demonstrating the problem and complete with representative data. BTW, the variable `$airport` is accessed in the final `print` statement, but never initialized. You would have seen this if you had begun the script with `use strict; use warnings;` [download] as Gangabass advised in Re: How to scraper ASP websites. Athanasius <°(((>< contra mundum	[reply] [d/l] [select]
Re^3: need help in scraping asp site by Anonymous Monk on Sep 06, 2012 at 07:44 UTC
Re^4: need help in scraping asp site by Corion (Patriarch) on Sep 06, 2012 at 07:57 UTC
Some notes below your chosen depth have not been shown here
Re^4: need help in scraping asp site by marto (Cardinal) on Sep 06, 2012 at 08:42 UTC
Some notes below your chosen depth have not been shown here


We don't bite newbies here... much
	PerlMonks