# [SOLVED] Extracting text from a Word Doc into Excel



## dgcarlin (Jul 6, 2012)

Hello,

I'm a VBA novice trying to solve a data processing challenge and would appreciate any suggestions. Basically, I'm trying to extract selected strings of text from a Word doc to Excel.

I have a massive Word doc with thousands of entries like this:
NEGATIVE DECLARATION-NG-12-196-PL: ENV-2012-595. *2701 S. Western Ave.; South Los Angeles*. Council District No. 10. Conditional Use to permit the addition of off-site sale of beer and wine only to an existing, approx. 3,115 sq. ft. convenience market, and to permit the hours of operation from 7:00 am to 9:00 pm daily, on an approx. 26,200 sq. ft. parcel in the C2-1 Zone. Please call a DAY in advance to review file: (213)978-1332. If no answer, please leave message. Documents are available for review by appointment ONLY at: Los Angeles City Hall, 200 N. Spring St., Rm 750, Los Angeles, CA 90012. Comments can be faxed to: (213)978-1343 or e-mailed to: [email protected]. REVIEW/COMMENT period ends: July 25, 2012


What I'd like to output in Excel is (in four separate columns...doesn't render properly in the forum interface):


| DETERMINATION | CODE | PROJECT # | PROJECT |

| NEGATIVE DECLARATION | NG-12-196-PL | ENV-2012-595 | 2701 S. Western Ave. ||


In the DETERMINATION column there are also entries that say "MITIGATED NEGATIVE DECLARATION," and of course the entries for all the other columns are unique as well, and they need to be matched up in a row.

Is this a good candidate for a VBA macro? Maybe I need to process the doc as tab-delimited text first, and then export to Excel? I am tech savvy with some basic programming knowledge but a total Noob at VBA, so any step-by-step instructions would be extremely helpful. Thanks in advance!

- Dan


----------



## macropod (Apr 11, 2008)

*Re: Extracting text from a Word Doc into Excel*

Hi Dan,

Getting the data into Excel isn't a big issue. The real issue is how they're going to be parsed. If you could attach a document with a representative sample of the range of data types to be processed, and a workbook showing the required output, that would give us a better idea of what's required. What's curcual for the parsing is markers in the text that delineate the content. So far, it seems you have colons, semi-colons and periods that might be used. Differentiating the Determination and its code _might _be a problem but, if the markers I've identified so far are consistent, the rest should be OK.


----------



## macropod (Apr 11, 2008)

*Re: Extracting text from a Word Doc into Excel*

Cross-posted at: Extracting text from a Word Doc into Excel
For cross-posting etiquette, please read: Excelguru Help Site - A message to forum cross posters


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Thanks Macropod.

I'm attaching a file showing the data I'm working with (the paragraphs of text following each entry are not going to be used), plus a screenshot of an Excel book showing what I hope the output would look like (used a jpg b/c uploader doesn't seem to support xls).

I appreciate any tips on how to automate this.

-Dan


----------



## macropod (Apr 11, 2008)

*Re: Extracting text from a Word Doc into Excel*

Hi Dan,

Try the following. As yet it doesn't add the data to Excel - it just parses it in Word and generates a Word table ready for pasting into Excel. I've even managed to preserve the hyperlinks in the 'PROJECT #' data.

Check to see that it's essentially what you're after. Give it some time to run too - you posted over 1000 paragraphs of data (I only asked for a _representative sample_) and they take a while to process. Because there are some inconsitencies in the data structures, you'll find a few cells with excess data in the 'PROJECT ID' column. I think you'll have to clean up those cells manually (that can be done later in Excel).

```
Sub ParseData()
' Turn Off Screen Updating
Application.ScreenUpdating = False
Dim i As Long, Rng As Range, StrTxt As String
With ActiveDocument.Range
  'Clear Doc Formatting
  .ParagraphFormat.TabStops.ClearAll
  .ParagraphFormat.Alignment = wdAlignParagraphLeft
  .InsertBefore vbCr
  With .Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Forward = True
    .Wrap = wdFindContinue
    .Format = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
    'Replace all hard spaces with soft spaces
    .Text = "^0160"
    .Replacement.Text = " "
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Replace all double spaces with single spaces
    .Text = "[ ]{2,}"
    .Replacement.Text = " "
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete spaces before/after paragraph breaks
    .Text = " [^13]"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    .Text = "[^13] "
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete surplus paragraph breaks
    .Text = "[^13]{1,}"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Prefix all paragraphs with a tab
    .Text = "[^13]{1,}"
    .Replacement.Text = "^p^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Join date paragraphs with following paragraphs
    .Text = "^t[MTWFS][ONDAYUESHRIT]{5,7}, ([JFMASOND][ANURYEBCHPILGSTMOV]{2,8} [0-9]{1,2}, [12][0-9]{3})^13"
    .Replacement.Text = "^p\1"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete paragraphs not containing a colon or semi-colon.
    .Text = "[^13][!:;]@[^13]"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    .Execute Replace:=wdReplaceAll
    DoEvents
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Replace hyphen preceding NG with tab
    .Text = "-(NG-[0-9]{2}-[0-9]{1,3}-PL)*(ENV-[0-9]{4}-[0-9]{1,4})"
    .Replacement.Text = "^t\1^t-^t\2^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Clear Periods after tabs
    .Text = "^t."
    .Replacement.Text = "^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete unwanted data
    .Text = "([!^13;]{1,}???*.)*(^13)"
    .Replacement.Text = "\1\2"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete all spaces before/after tabs
    .Text = "[ ^t]{2,3}"
    .Replacement.Text = "^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete redundant paragraph breaks
    .Text = "[^13]{1,}"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
  End With
  'Set output format
  With .Font
    .Bold = False
    .Name = "Calibri"
    .Size = "11"
  End With
  'Create headings
  .InsertBefore vbTab & "DETERMINATION" & vbTab & "CODE" & vbTab & "DATE" & vbTab & "PROJECT #" & vbTab & "PROJECT ID"
  'Convert document to a table
  .ConvertToTable vbTab
  DoEvents
  'Add dates to column 4 as required, clearing from column 1
  With .Tables(1)
    For i = 2 To .Rows.Count
      Set Rng = .Cell(i, 1).Range
      With Rng
        .End = .End - 1
        If .Text <> "" Then StrTxt = .Text: .Text = vbNullString
      End With
      .Cell(i, 4).Range.Text = StrTxt
      DoEvents
    Next
  End With
End With
' Restore Screen Updating
Application.ScreenUpdating = True
End Sub
```


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Wow. Thank you so much Paul, it works great!! And sorry for the mega-sample. I wanted to make sure you saw the full range of data and probably overdid it. Really can't thank you enough, your code is going to save me weeks of tedium.


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Quick follow-up: with another chunk of data (sample attached) it doesn't quite work the same way....if I understood the mechanics of your code I'm sure it would be an easy tweak, but most of this is over my head. It parses the date just fine, but all of the other fields end up in the same column. I can't tell what if anything is different about this data set. Maybe you can, and suggest an adjustment to the code? Thanks again!


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Something I noticed: in the data I attached, which is representative of most of the data it turns out, the code (i.e. the string right after the DETERMINATION) does not include a two- or three-letter prefix (e.g. NG, MNG). That might have thrown off the macro.


----------



## macropod (Apr 11, 2008)

*Re: Extracting text from a Word Doc into Excel*

To work with the latest data as well (which is part of the _representative sample _you didn't provide ... ) change:
.Text = "-(NG-[0-9]{2}-[0-9]{1,3}-PL)*(ENV-[0-9]{4}-[0-9]{1,4})"
to:
.Text = "-([A-Z0-9]{2,3}-*[0-9]{1,3}-PL)*(ENV-[0-9]{4}-[0-9]{1,4})"


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Thanks a lot Paul. This works great on most of my data, but I'm running into a strange bug which is that for a couple of files, each one representing a year of data with hundreds of pages of records, once I parse them it comes to only 10 or 15 pages -- it seems to stop parsing after a certain point in the file, omitting hundreds of records. If you still have time to look at this, I would appreciate it. I'm attaching a sample of one file where this error occurs. Thanks for all your help with this.


----------



## macropod (Apr 11, 2008)

*Re: Extracting text from a Word Doc into Excel*

The immediate problem I see is that your latest document includes many records for which the PROJECT # data don't have hyperlinks. Because of the way the Find operation works, the portion of the code that parses that data skips all the way down till it finds the next set of PROJECT # data without the hyperlink and deletes everything in between. Since you have hyperlinks in some cases and not others, the question arises whether the preservation of the hyperlinks is important. If it isn't, I can add code to kill them all at the start of the processing. You'll then get an output that contains the PROJECT # data without the hyperlinks.

Yet again, this comes back to the issue of you not supplying a representaive sample of the data. I don't propose to keep revisiting this project on a piecemeal basis for every new issue you find because you haven't done that.


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Hi Paul, having the hyperlinks is not important. If you don't mind adding the code to kill them that would be great. Thanks!


----------



## macropod (Apr 11, 2008)

*Re: Extracting text from a Word Doc into Excel*

Try:

```
Sub ParseData()
' Turn Off Screen Updating
Application.ScreenUpdating = False
Dim i As Long, Rng As Range, StrTxt As String
With ActiveDocument.Range
  'Clear Doc Formatting
  .ParagraphFormat.TabStops.ClearAll
  .ParagraphFormat.Alignment = wdAlignParagraphLeft
  .InsertBefore vbCr
  .Fields.Unlink
  With .Find
    .ClearFormatting
    .Replacement.ClearFormatting
    .Forward = True
    .Wrap = wdFindContinue
    .Format = False
    .MatchAllWordForms = False
    .MatchSoundsLike = False
    .MatchWildcards = True
    'Ensure there is a space after every colon & semi-colon
    .Text = "[:;]"
    .Replacement.Text = "^& "
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Replace all hard spaces with soft spaces
    .Text = "^0160"
    .Replacement.Text = " "
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Replace all double spaces with single spaces
    .Text = "[ ]{2,}"
    .Replacement.Text = " "
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete spaces before/after paragraph breaks
    .Text = " [^13]"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    .Text = "[^13] "
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete surplus paragraph breaks
    .Text = "[^13]{1,}"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Prefix all paragraphs with a tab
    .Text = "[^13]{1,}"
    .Replacement.Text = "^p^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Join date paragraphs with following paragraphs
    .Text = "^t[MTWFS][ONDAYUESHRIT]{5,7}, ([JFMASOND][ANURYEBCHPILGSTMOV]{2,8} [0-9]{1,2}, [12][0-9]{3})^13"
    .Replacement.Text = "^p\1"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete paragraphs not containing a colon or semi-colon.
    .Text = "[^13][!:;]@[^13]"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
    .Execute Replace:=wdReplaceAll
    DoEvents
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Diferentiate the DETERMINATION, CODE & PROJECT # data
    .Text = "-([A-Z0-9]{2,3}-[0-9\-]{3,8}PL)[:;.][!E]{1,}(ENV-[0-9]{4}-[0-9]{1,4})"
    .Replacement.Text = "^t\1^t-^t\2^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Clear Periods after tabs
    .Text = "^t."
    .Replacement.Text = "^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete unwanted data
    .Text = "([!^13;]{1,}???*.)*(^13)"
    .Replacement.Text = "\1\2"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete all spaces before/after tabs
    .Text = "[ ^t]{2,3}"
    .Replacement.Text = "^t"
    .Execute Replace:=wdReplaceAll
    DoEvents
    'Delete redundant paragraph breaks
    .Text = "[^13]{1,}"
    .Replacement.Text = "^p"
    .Execute Replace:=wdReplaceAll
    DoEvents
  End With
  'Set output format
  With .Font
    .Bold = False
    .Name = "Calibri"
    .Size = "11"
    .ColorIndex = wdAuto
  End With
  'Create headings
  .InsertBefore vbTab & "DETERMINATION" & vbTab & "CODE" & vbTab & "DATE" & vbTab & "PROJECT #" & vbTab & "PROJECT ID"
  'Convert document to a table
  .ConvertToTable vbTab
  DoEvents
  'Add dates to column 4 as required, clearing from column 1
  With .Tables(1)
    For i = 2 To .Rows.Count
      Set Rng = .Cell(i, 1).Range
      With Rng
        .End = .End - 1
        If .Text <> "" Then StrTxt = .Text: .Text = vbNullString
      End With
      .Cell(i, 4).Range.Text = StrTxt
      DoEvents
    Next
  End With
End With
' Restore Screen Updating
Application.ScreenUpdating = True
End Sub
```
The line that kills the hyperlinks is:
.Fields.Unlink
I also had to add some code to ensure there are spaces after all colons and semi-colons as it was lacking in at least one of your latest records. You'll still end up with an extra row in the output data, though, as the record for 926-07-PL has a paragraph break in it.


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Perfect, thanks!


----------



## macropod (Apr 11, 2008)

*Re: Extracting text from a Word Doc into Excel*

OK, so do you still want the data to go to Excel and, if so, should the macro be run from Word or Excel? Do you have just one file to process, or many? If outputting to Excel, is there just one Excel file or, say, one per document, and does each destination workbook already exist and how are any existing Excel files to be identified?


----------



## dgcarlin (Jul 6, 2012)

*Re: Extracting text from a Word Doc into Excel*

Actually, it works fine the way you set it up -- I just copy and paste from Word to Excel. There are only 10 files, so I just C&P each parsed output to a different sheet in the Excel book. Not much work.


----------



## macropod (Apr 11, 2008)

Ok, I'll consider the topic closed then.


----------

