http://qs321.pair.com?node_id=11136554

szpt9m has asked for the wisdom of the Perl Monks concerning the following question:

Hello All, I am using this regex in my perl script .*TEXT:.*?([a-zA-Z0-9_\x7f-\xff.\w]+)" my input is  TEXT: "C:\temp\test äbc.txt" expected out is test äbc.txt but since i have space in the filename it returns only äbc.txt. can someone please help me to resolve this?

Replies are listed 'Best First'.
Re: Regex to get file name from the path with spaces (updated)
by haukex (Archbishop) on Sep 08, 2021 at 07:57 UTC

    Don't use a regex, use File::Spec, which is a core module (or you can use e.g. Path::Class or Path::Tiny from CPAN). use File::Spec; my $filename = File::Spec->splitpath($fullpath); - if this script is running on a non-Windows system but you need to handle Windows filenames, write File::Spec::Win32 instead of File::Spec.

    Update: After rereading I realize you're also trying to extract the path from a string that looks like "TEXT: \"...\"". I feel like we might be missing some context, because the file format is unclear to me - is this some standardized file format you're trying to extract a part of? If so, what format? Usually quoted strings also have some kind of escaping mechanism, is that the case here? All of this things will affect what the best solution is. If the format really is as simple as it seems, then I would combine Corion's solution to extract the string from the quotes with my suggestion above to extract the filename.

      Thanks for the response. I have many other lines to capture in the regex and here to make it easier i have given a part of the regex where i am facing issue. So i need a fix in the regex so that i can adapt the existing code

        Maybe you can think of your capture problem in a different way then. Most likely, you want everything after the last "path separator" up until the double quotes:

        TEXT:\s+".*?[\\/]([^\\/"]+)"

        The filename can contain everything except a path separator (\ or /) and double quotes, and must be followed by double quotes.

Re: Regex to get file name from the path with spaces
by Corion (Patriarch) on Sep 08, 2021 at 07:48 UTC

    Maybe you want to instead capture everything up to the second double quote?

    .*TEXT:\s+"([^"]+)"

    If you have both, lines with and lines without double quotes, you will have to show more data.

    Update: After reading haukex answer, I now realize that I only understood half of your question. Use my answer to extract the full path and then use File::Spec to get the actual filename.

      Hi, I dont need everything upto second double quote but i need last part of the line after / which is the path name. and i need this in regex only so that i can adapt the existing code
Re: Regex to get file name from the path with spaces
by Anonymous Monk on Sep 10, 2021 at 17:28 UTC

    At the risk of stating the obvious, if you want your character class (the stuff inside the square brackets) to match a space, you should include a space in your character class. If you want only a space (that is, not other space-like characters like tabs) something like this should work: .*TEXT:.*?([a-zA-Z0-9_\x7f-\xff.\w ]+)". This is your regex with a space inserted before the right square bracket. If you want all the space-like stuff, use \s instead of a literal space.

    You did not ask about this, but I observe that your character class appears to contain unneeded information. I am not aware of any circumstance where \w does not include ranges a-z, A-Z, 0-9, and the underscore (_). Certainly it does under ASCII, the ISO encodings, CP1252, and Unicode. So you should find that .*TEXT:.*?([\x7f-\xff.\w ]+)" matches everything you want, and is easier to understand.