Advertisement
Advertisement


How to determine if an html tag splits across multiple lines


Question

I'm writing a PHP script that involves scraping web pages. Currently, the script analyzes the page line by line, but it breaks if there is a tag that spans multiple lines, like

<img src="example.jpg"
alt="example">

If worse comes to worst, I could possibly preprocess the page by removing all line breaks, then re-inserting them at the closest >, but this seems like a kludge.

Ideally, I'd be able to detect a tag that spans lines, conjoin only those to lines, and continue processing.
So what's the best method to detect this?

2019/01/18
1
3
1/18/2019 11:10:50 AM

Accepted Answer

Perhaps for future projects I'll use a parsing library, but that's kind of aside from the question at hand. This is my current solution. rstrpos is strpos, but from the reverse direction. Example use:

for($i=0; $i<count($lines); $i++)
{
    $line = handle_mulitline_tags(&$i, $line, $lines);
}

And here's that implementation:

function rstrpos($string, $charToFind, $relativePos)
{
    $searchPos = $relativePos;
    $searchChar = '';

    while (($searchChar != $charToFind)&&($searchPos>-1))
    {
        $newPos = $searchPos-1;
        $searchChar = substr($string,$newPos,strlen($charToFind));
        $searchPos = $newPos;
    }

    if (!empty($searchChar))
    {
        return $searchPos;
        return TRUE;
    }
    else
    {
        return FALSE;
    }
}

function handle_multiline_tags(&$i, $line, $lines)
{
    //if a tag is opened but not closed before a line break,

    $open = rstrpos($line, '<', strlen($line));
    $close = rstrpos($line, '>', strlen($line));
    if(($open > $close)&&($open > -1)&&($close > -1))
    {
        $i++;
        return trim($line).trim(handle_multiline_tags(&$i, $lines[$i], $lines));
    }
    else
    {
        return trim($line);
    }
}

This could probably be optimized in some way, but for my purposes, it's sufficient.

2008/08/29
1
8/29/2008 4:20:57 PM


Don't write a parser, use someone else's: DOMDocument::loadHTML - that's just one, I think there are a lot of others.

2008/08/29

Well, this doesn't answer the question and is more of an opinion, but...

I think that the best scraping strategy (and consequently, to eliminate this problem) is not to analyze an HTML line by line, which is unnatural to HTML, but to analyze it by its natural delimiter: <> pairs.

There will be two types of course:

  • Tag elements that are immediately closed, e.g., < br />
  • Tag elements that need a separate closing tag, e.g., < p > text < /p >

You can immediately see the advantage of using this strategy in the case of paragraph(p) tags: It will be easier to parse mutiline paragraphs instead of having to track where the closing tag is.

2008/08/29

Why don't you read in a line, and set it to a string, then check the string for tag openings and closings, If a tag spans more then one line add the next line to the string and move the part before the opening brace to your processed string. Then just parse through the entire file doing this. Its not beautiful but it should work.

2008/08/29

If you've gotta stick to your current method of parsing, and it's a regex, you can use the multi-line flag "m" to span across multiple lines.

2008/08/29

Source: https://stackoverflow.com/questions/33814
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]