About html2xml

XSolvo html2xml is a free program that converts HTML data to XML.

The HTML format is a loose format that makes it quite a difficulty in parsing and processing data from HTML files. It is much simpler to have it in XML instead.

With the file in XML you can use common utilities and tools to process the data.

So, get rid of the HTML parsing problem with html2xml and convert your HTML pages to XML. Then you can process it as you want with XML parsers, XSLT etc...

Current version

Version, you can download it from the download page.

Se the html2xml ChangeLog to see changes between releases.

Added a command line version of the program.

Now it has support for converting html from following codepages:

ASMO-708 DOS-720 iso-8859-6 arabic csISOLatinArabic
ECMA-114 ISO_8859-6 ISO_8859-6:1987 iso-ir-127 x-mac-arabic
windows-1256 ibm775 CP500 iso-8859-4 csISOLatin4
ISO_8859-4 ISO_8859-4:1988 iso-ir-110 l4 latin4
windows-1257 ibm852 iso-8859-2 csISOLatin2 iso_8859-2
iso_8859-2:1987 iso8859-2 iso-ir-101 l2 latin2
x-mac-ce windows-1250 x-cp1250 cp866 ibm866
cp1251 iso-8859-5 csISOLatin5 csISOLatinCyrillic cyrillic
ISO_8859-5 ISO_8859-5:1988 iso-ir-144 iso8859-5 koi8-r
csKOI8R koi koi8 koi8r koi8-u
koi8-ru x-mac-cyrillic windows-1251 cp1251 x-Europa
x-IA5-German ibm737 cp737 iso-8859-7 csISOLatinGreek
ECMA-118 ELOT_928 greek greek8 ISO_8859-7
ISO_8859-7:1987 iso-ir-126 x-mac-greek windows-1253 ibm869
cp869 DOS-862 iso-8859-8-i logical iso-8859-8
csISOLatinHebrew hebrew ISO_8859-8 ISO_8859-8:1988 ISO-8859-8
iso-ir-138 visual x-mac-hebrew windows-1255 ISO_8859-8-I
ISO-8859-8 visual CP870 CP1026 ibm861
iso-8859-3 csISO Latin3 ISO_8859-3 ISO_8859-3:1988
iso-ir-109 l3 latin3 iso8859_3 iso-8859-15
csISO Latin9 ISO_8859-15 l9 latin9
x-IA5-Norwegian IBM437 437 cp437 csPC8
CodePage437 x-IA5-Swedish windows-874 DOS-874 iso-8859-11
TIS-620 cp874 ibm857 cp857 iso-8859-9
csISO Latin5 ISO_8859-9 ISO_8859-9:1989 iso-ir-148
l5 latin5 x-mac-turkish windows-1254 cp1254
ISO_8859-9 ISO_8859-9:1989 iso-8859-9 iso-ir-148 latin5
us-ascii ANSI_X3.4-1968 ANSI_X3.4-1986 ascii cp367
csASCII IBM367 ISO_646.irv:1991 ISO646-US iso-ir-6us
windows-1258 ibm850 x-IA5 iso-8859-1 cp819
csISO Latin1 ibm819 iso_8859-1 iso_8859-1:1987
iso8859-1 iso-ir-100 l1 latin1 macintosh
Windows-1252 ANSI_X3.4-1968 ANSI_X3.4-1986 ascii cp367
cp819 csASCII IBM367 ibm819 ISO_646.irv:1991
iso_8859-1 iso_8859-1:1987 ISO646-US iso8859-1 iso-8859-1
iso-ir-100 iso-ir-6 latin1 us us-ascii
x-ansi ascii cp1250 cp1251 cp1252
cp1253 cp1254 cp1255 cp1256 cp1257
cp1258 cp437 cp737 cp775 cp850
cp852 cp855 cp856 cp857 cp860
cp861 cp863 cp864 cp865 cp866
cp869 cp874 iso8859_10 iso8859_13 iso8859_14
iso8859_15 iso8859_16 iso8859_1 iso8859_2 iso8859_3
iso8859_4 iso8859_5 iso8859_6 iso8859_7 iso8859_8
iso8859_9 koi8_r 


You can report bugs and enhancement request at http://bugs.xsolvo.com Use the html2xml queue.


To handle all charset in HTML the output file is in Unicode format.

