Advertisement
Advertisement


Match the path of a URL, minus the filename extension


Question

What would be the best regular expression for this scenario?

Given this URL:

http://php.net/manual/en/function.preg-match.php

How should I go about selecting everything between (but not including) http://php.net and .php:

/manual/en/function.preg-match

This is for an Nginx configuration file.

2014/06/30
1
11
6/30/2014 5:46:46 PM

Accepted Answer

Like this:

if (preg_match('/(?<=net).*(?=\.php)/', $subject, $regs)) {
    $result = $regs[0];
}

Explanation:

"
(?<=      # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
   net       # Match the characters “net” literally
)
.         # Match any single character that is not a line break character
   *         # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?=       # Assert that the regex below can be matched, starting at this position (positive lookahead)
   \.        # Match the character “.” literally
   php       # Match the characters “php” literally
)
"
2013/10/16
8
10/16/2013 12:37:06 AM


Try this:

preg_match("/net(.*)\.php$/","http://php.net/manual/en/function.preg-match.php", $matches);
echo $matches[1];
// prints /manual/en/function.preg-match
2011/11/29

There's no need to use a regular expression to dissect a URL. PHP has built-in functions for this, pathinfo() and parse_url().

2014/06/30

Just for the fun of it, here are two ways that have not been explored:

substr($url, strpos($s, '/', 8), -4)

Or:

substr($s, strpos($s, '/', 8), -strlen($s) + strrpos($s, '.'))

Based on the idea that HTTP schemes http:// and https:// are at most 8 characters, so typically it suffices to find the first slash from the 9th position onwards. If the extension is always .php the first code will work, otherwise the other one is required.

For a pure regular expression solution you can break the string down like this:

~^(?:[^:/?#]+:)?(?://[^/?#]*)?([^?#]*)~
                              ^

The path portion would be inside the first memory group (i.e. index 1), indicated by the ^ in the line underneath the expression. Removing the extension can be done using pathinfo():

$parts = pathinfo($matches[1]);
echo $parts['dirname'] . '/' . $parts['filename'];

You can also tweak the expression to this:

([^?#]*?)(?:\.[^?#]*)?(?:\?|$)

This expression is not very optimal though, because it has some back tracking in it. In the end I would go for something less custom:

$parts = pathinfo(parse_url($url, PHP_URL_PATH));
echo $parts['dirname'] . '/' . $parts['filename'];
2012/09/02

This general URL match allows you to select parts of a URL:

if (preg_match('/\\b(?P<protocol>https?|ftp):\/\/(?P<domain>[-A-Z0-9.]+)(?P<file>\/[-A-Z0-9+&@#\/%=~_|!:,.;]*)?(?P<parameters>\\?[-A-Z0-9+&@#\/%=~_|!:,.;]*)?/i', $subject, $regs)) {
    $result = $regs['file'];
    //or you can append the $regs['parameters'] too
} else {
    $result = "";
}
2012/01/05

|(?<=\w)/.+(?=\.\w+$)|

  • select everything from the first literal '/' preceded by
  • look behind a Word(\w) character
  • until followed by a look ahead
    • literal '.' appended by
    • one or more Word(\w) characters
    • before the end $
  re> |(?<=\w)/.+(?=\.\w+$)|
Compile time 0.0011 milliseconds
Memory allocation (code space): 32
  Study time 0.0002 milliseconds
Capturing subpattern count = 0
No options
First char = '/'
No need char
Max lookbehind = 1
Subject length lower bound = 2
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0007 milliseconds
 0: /manual/en/function.preg-match

|//[^/]*(.*)\.\w+$|

  • find two literal '//' followed by anything but a literal '/'
  • select everything until
  • find literal '.' followed by only Word \w characters before the end $
  re> |//[^/]*(.*)\.\w+$|
Compile time 0.0010 milliseconds
Memory allocation (code space): 28
  Study time 0.0002 milliseconds
Capturing subpattern count = 1
No options
First char = '/'
Need char = '.'
Subject length lower bound = 4
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0005 milliseconds
 0: //php.net/manual/en/function.preg-match.php
 1: /manual/en/function.preg-match

|/[^/]+(.*)\.|

  • find literal '/' followed by at least 1 or more non literal '/'
  • aggressive select everything before the last literal '.'
  re> |/[^/]+(.*)\.|
Compile time 0.0008 milliseconds
Memory allocation (code space): 23
  Study time 0.0002 milliseconds
Capturing subpattern count = 1
No options
First char = '/'
Need char = '.'
Subject length lower bound = 3
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0005 milliseconds
 0: /php.net/manual/en/function.preg-match.
 1: /manual/en/function.preg-match

|/[^/]+\K.*(?=\.)|

  • find literal '/' followed by at least 1 or more non literal '/'
  • Reset select start \K
  • aggressive select everything before
  • look ahead last literal '.'
  re> |/[^/]+\K.*(?=\.)|
Compile time 0.0009 milliseconds
Memory allocation (code space): 22
  Study time 0.0002 milliseconds
Capturing subpattern count = 0
No options
First char = '/'
No need char
Subject length lower bound = 2
No set of starting bytes
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0005 milliseconds
 0: /manual/en/function.preg-match

|\w+\K/.*(?=\.)|

  • find one or more Word(\w) characters before a literal '/'
  • reset select start \K
  • select literal '/' followed by
  • anything before
  • look ahead last literal '.'
  re> |\w+\K/.*(?=\.)|
Compile time 0.0009 milliseconds
Memory allocation (code space): 22
  Study time 0.0003 milliseconds
Capturing subpattern count = 0
No options
No first char
Need char = '/'
Subject length lower bound = 2
Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P 
  Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z 
data> http://php.net/manual/en/function.preg-match.php
Execute time 0.0011 milliseconds
 0: /manual/en/function.preg-match
2012/09/01

Source: https://stackoverflow.com/questions/8313941
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]