Match the path of a URL, minus the filename extension
Match the path of a URL, minus the filename extension
Question
What would be the best regular expression for this scenario?
Given this URL:
http://php.net/manual/en/function.preg-match.php
How should I go about selecting everything between (but not including) http://php.net
and .php
:
/manual/en/function.preg-match
This is for an Nginx configuration file.
Accepted Answer
Like this:
if (preg_match('/(?<=net).*(?=\.php)/', $subject, $regs)) {
$result = $regs[0];
}
Explanation:
"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
net # Match the characters “net” literally
)
. # Match any single character that is not a line break character
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
\. # Match the character “.” literally
php # Match the characters “php” literally
)
"
Popular Answer
A regular expression might not be the most effective tool for this job.
Try using parse_url()
, combined with pathinfo()
:
$url = 'http://php.net/manual/en/function.preg-match.php';
$path = parse_url($url, PHP_URL_PATH);
$pathinfo = pathinfo($path);
echo $pathinfo['dirname'], '/', $pathinfo['filename'];
The above code outputs:
/manual/en/function.preg-match
Read more… Read less…
Try this:
preg_match("/net(.*)\.php$/","http://php.net/manual/en/function.preg-match.php", $matches);
echo $matches[1];
// prints /manual/en/function.preg-match
There's no need to use a regular expression to dissect a URL. PHP has built-in functions for this, pathinfo() and parse_url().
Just for the fun of it, here are two ways that have not been explored:
substr($url, strpos($s, '/', 8), -4)
Or:
substr($s, strpos($s, '/', 8), -strlen($s) + strrpos($s, '.'))
Based on the idea that HTTP schemes http://
and https://
are at most 8 characters, so typically it suffices to find the first slash from the 9th position onwards. If the extension is always .php
the first code will work, otherwise the other one is required.
For a pure regular expression solution you can break the string down like this:
~^(?:[^:/?#]+:)?(?://[^/?#]*)?([^?#]*)~
^
The path portion would be inside the first memory group (i.e. index 1), indicated by the ^
in the line underneath the expression. Removing the extension can be done using pathinfo()
:
$parts = pathinfo($matches[1]);
echo $parts['dirname'] . '/' . $parts['filename'];
You can also tweak the expression to this:
([^?#]*?)(?:\.[^?#]*)?(?:\?|$)
This expression is not very optimal though, because it has some back tracking in it. In the end I would go for something less custom:
$parts = pathinfo(parse_url($url, PHP_URL_PATH));
echo $parts['dirname'] . '/' . $parts['filename'];
This general URL match allows you to select parts of a URL:
if (preg_match('/\\b(?P<protocol>https?|ftp):\/\/(?P<domain>[-A-Z0-9.]+)(?P<file>\/[-A-Z0-9+&@#\/%=~_|!:,.;]*)?(?P<parameters>\\?[-A-Z0-9+&@#\/%=~_|!:,.;]*)?/i', $subject, $regs)) {
$result = $regs['file'];
//or you can append the $regs['parameters'] too
} else {
$result = "";
}
|(?<=\w)/.+(?=\.\w+$)|
- select everything from the first literal '/' preceded by
- look behind a Word(\w) character
- until followed by a look ahead
- literal '.' appended by
- one or more Word(\w) characters
- before the end $
re> |(?<=\w)/.+(?=\.\w+$)| Compile time 0.0011 milliseconds Memory allocation (code space): 32 Study time 0.0002 milliseconds Capturing subpattern count = 0 No options First char = '/' No need char Max lookbehind = 1 Subject length lower bound = 2 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0007 milliseconds 0: /manual/en/function.preg-match
|//[^/]*(.*)\.\w+$|
- find two literal '//' followed by anything but a literal '/'
- select everything until
- find literal '.' followed by only Word \w characters before the end $
re> |//[^/]*(.*)\.\w+$| Compile time 0.0010 milliseconds Memory allocation (code space): 28 Study time 0.0002 milliseconds Capturing subpattern count = 1 No options First char = '/' Need char = '.' Subject length lower bound = 4 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0005 milliseconds 0: //php.net/manual/en/function.preg-match.php 1: /manual/en/function.preg-match
|/[^/]+(.*)\.|
- find literal '/' followed by at least 1 or more non literal '/'
- aggressive select everything before the last literal '.'
re> |/[^/]+(.*)\.| Compile time 0.0008 milliseconds Memory allocation (code space): 23 Study time 0.0002 milliseconds Capturing subpattern count = 1 No options First char = '/' Need char = '.' Subject length lower bound = 3 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0005 milliseconds 0: /php.net/manual/en/function.preg-match. 1: /manual/en/function.preg-match
|/[^/]+\K.*(?=\.)|
- find literal '/' followed by at least 1 or more non literal '/'
- Reset select start \K
- aggressive select everything before
- look ahead last literal '.'
re> |/[^/]+\K.*(?=\.)| Compile time 0.0009 milliseconds Memory allocation (code space): 22 Study time 0.0002 milliseconds Capturing subpattern count = 0 No options First char = '/' No need char Subject length lower bound = 2 No set of starting bytes data> http://php.net/manual/en/function.preg-match.php Execute time 0.0005 milliseconds 0: /manual/en/function.preg-match
|\w+\K/.*(?=\.)|
- find one or more Word(\w) characters before a literal '/'
- reset select start \K
- select literal '/' followed by
- anything before
- look ahead last literal '.'
re> |\w+\K/.*(?=\.)| Compile time 0.0009 milliseconds Memory allocation (code space): 22 Study time 0.0003 milliseconds Capturing subpattern count = 0 No options No first char Need char = '/' Subject length lower bound = 2 Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z data> http://php.net/manual/en/function.preg-match.php Execute time 0.0011 milliseconds 0: /manual/en/function.preg-match