How to "scan" a website (or page) for info, and bring it into my program?

java html web-scraping jsoup

Use a HTML parser like Jsoup. This has my preference above the other HTML parsers available in Java since it supports jQuery like CSS selectors. Also, its class representing a list of nodes, Elements, implements Iterable so that you can iterate over it in an enhanced for loop (so there's no need to hassle with verbose Node and NodeList like classes in the average Java DOM parser).

Here's a basic kickoff example (just put the latest Jsoup JAR file in classpath):

package com.stackoverflow.q2835505;import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;public class Test {    public static void main(String[] args) throws Exception {        String url = "https://stackoverflow.com/questions/2835505";        Document document = Jsoup.connect(url).get();        String question = document.select("#question .post-text").text();        System.out.println("Question: " + question);        Elements answerers = document.select("#answers .user-details a");        for (Element answerer : answerers) {            System.out.println("Answerer: " + answerer.text());        }    }}

As you might have guessed, this prints your own question and the names of all answerers.

java html web-scraping jsoup

This is referred to as screen scraping, wikipedia has this article on the more specific web scraping. It can be a major challenge because there's some ugly, mess-up, broken-if-not-for-browser-cleverness HTML out there, so good luck.

java html web-scraping jsoup

I would use JTidy - it is simlar to JSoup, but I don't know JSoup well. JTidy handles broken HTML and returns a w3c Document, so you can use this as a source to XSLT to extract the content you are really interested in. If you don't know XSLT, then you might as well go with JSoup, as the Document model is nicer to work with than w3c.

EDIT: A quick look on the JSoup website shows that JSoup may indeed be the better choice. It seems to support CSS selectors out the box for extracting stuff from the document. This may be a lot easier to work with than getting into XSLT.

CodeHunter

How to "scan" a website (or page) for info, and bring it into my program?

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last